Unicode characters messed up after linebreak

Unicode characters messed up after linebreak - unicode

Certain combinations of Unicode characters seem to be problematic. I'll show you what I mean using Notepad++.
Create a new text file in Notepad++ and change the encoding to UTF-8 (BOM doesn't matter).
Copy and paste the following four arrows: ↑↓↙↘. This should look like fine (see first image below).
Now insert a newline after the second arrow (Windows/Unix doesn't matter). Now the first line still looks fine, but the arrows in the second line are replaced by placeholder boxes (see second image below).
Saving and reopening makes no difference. Still boxes in the second line. Remove the linebreak, and everything looks fine again.
This problem isn't exclusive to Notepad++. Other programs also show garbage when loading the text file with a linebreak. Surprisingly, the standard Windows Notepad displays it just fine.
This is the working file, once in hex and once within Notepad++:
E2 86 91 E2 86 93 E2 86 99 E2 86 98
This is the broken file. Notice all that's different is the added linebreak (0D 0A).
E2 86 91 E2 86 93 0D 0A E2 86 99 E2 86 98
Can anybody share some light on what's happening here?
Edit: I'm writing a program that creates output in a text format. I stumbled upon the problem when several text editors wouldn't display my program's output correctly, so I first assumed there was something wrong with my program. As it stands, its output is just fine. So the real question is:
Is there a way to change the second (broken) example so that it will display correctly in your typical editor?

This is a font problem that exhibits some bugs or deficiencies in text editors. One might actually ask why e.g. Notepad++ shows “↙↘” at all when it is using Courier New (which I think is its default font). That font (as well as many other fonts) do not contain those characters at all.
Looking at the sample in the question you can probably see that in “↑↓↙↘”, the first two characters are in different style from the other two. The reason is that they are displayed in two different fonts. (I see them in Arial and in DejaVu Sans. Your mileage may vary, depending on fonts installed in your system and your browser’s fallback font list.)
Similar things happen e.g. in Notepad++ and Notepad. When the primary font being used does not contain all the characters in the text, the program uses some fallback font(s). This might be hard-wired in the program code, or it might be user-settable.
For some reason, in Notepad ++, the font fallback mechanism fails in some situations. It also happens if you just delete the first two characters, or initially enter just “↙↘”. Apparently, what precedes those characters on the same line affects the font selection mechanism. You might consider submitting a bug report, but it might be classified as a feature, not a bug. After all, asking a program to render characters that do not appear in a font that the program has been set to use might cause general failure, rather than just a failure in some cases.
The solution is that when using a text editor to view data, the editor should be set to use a font that contains all the characters appearing in the text. See a list of fonts supporting “↙” (not exhaustive, but probably covers rather well the fonts you can expect a normal computer to have installed). In a text editor, you might wish to use a monospace font; in that case, DejaVu Sans Mono might be adequate (unless there are other relatively uncommon special characters – the font has only 3,310 glyphs).

Related

I need to remove a specific unicode in my existing subtitle text file

I basically work on subtitles and I have this arabic file and when I open it up on notepad and right click and select SHOW UNICODE CONTROL CHARACTERS I give me some weird characters on the left of every line. I tried so many ways to remove it but failed I also tried NOTEPAD++ but failed.
Notepad ++
SUBTITLE EDIT
EXCEL
WORD
288
00:24:41,960 --> 00:24:43,840
‫أتعلم، قللنا من شأنك فعلاً‬
289
00:24:44,000 --> 00:24:47,120
‫كان علينا تجنيدك لتكون جاسوساً‬
‫مكان (كاي سي)‬
290
00:24:47,280 --> 00:24:51,520
‫لا تعلمون كم أنا سعيد‬
‫لسماع ذلك‬
291
00:24:54,800 --> 00:24:58,160
‫لا تقلق، سيستيقظ نشيطاً غداً‬
292
00:24:58,320 --> 00:25:00,800
‫ولن يتذكر ما حصل‬
‫في الساعات الـ٦‬
the unicodes are not showing in this the unicode is U+202B which shows a ¶ sign, after googling it I think it's called PILCROW.
The issue with this is that it doesn't display subtitles correctly on ps4 app.
I need this PILCROW sign to go away. with this website I can see the issue in this file https://www.soscisurvey.de/tools/view-chars.php

The PILCROW ¶ is used by various software and publishers to show the end of a line in a document. The actual Unicode character does not exist in your file so you can't get rid of it.

The Unicode characters in these lines are 'RIGHT-TO-LEFT EMBEDDING'
(code \u202b) and 'POP DIRECTIONAL FORMATTING' (code \u202c) -
these are used in the text to indicate that the included text should be rendered
right-to-left instead of the ocidental left-to-right direction.
Now, these characters are included as hints to the application displaying the text, rather than to actually perform the text reversing - so they likely can be removed without compromising the text displaying itself.
Now this a programing Q&A site, but you did not indicate any programming language you are familiar with - enough for at least running a program. So it is very hard to know how give an answer that is suitable to you.
Python can be used to create a small program to filter such characters from a file, but I am not willing to write a full fledged GUI program, or an web app that you could run there just as an answer here.
A program that can work from the command line just to filter out a few characters is another thing - as it is just a few lines of code.
You have to store the follwing listing as a file named, say "fixsubtitles.py" there, and, with a terminal ("cmd" if you are on Windows) type python3 fixsubtitles.py \path\to\subtitlefile.txt and press enter.
That, of course, after installing Python3 runtime from http://python.org
(if you are on Mac or Linux that is already pre-installed)
import sys
from pathlib import Path
encoding = "utf-8"
remove_set = str.maketrans("\u202b\u202c")
if len(sys.argv < 2):
print("Usage: python3 fixsubtitles.py [filename]", file=sys.stderr)
exit(1)
path = Path(sys.argv[1])
data = path.read_text(encoding=encoding)
path.write_text(data.translate("", "", remove_set), encoding=encoding)
print("Done")
You may need to adjust the encoding - as Windows not always use utf-8 (the files can be in, for example "cp1256" - if you get an unicode error when running the program try using this in place of "utf-8") , and maybe add more characters to the set of characters to be removed - the tool you linked in the question should show you other such characters if any. Other than that, the program above should work

What character is this:?

EDIT
While posting the question, character I ask for was shown well to me, but after postig it does not show up anymore. As it does not appear, please look up in original site
EDIT2
I looked for Unicode chars associated with "alien", and found no matching ones. Here is how they are compared side by side:
I found, that some texts inside my database contain character like . I am not sure, how it would rendered with different fonts and environments, so here is the image, how I see it:
I tried to identify it with different ways. For example, when I paste it into Sublime Text, it automatically shows as control character <0x85>. When I tried to identify it in different unicode-detectors (http://www.babelstone.co.uk/Unicode/whatisit.html, https://unicode-table.com/en/, https://unicode-search.net/unicode-namesearch.pl), their conclusion is pretty match the same:
Unicode code point character U+0085
UTF-8 encoding c2 85 hexadecimal
194 133 decimal
0302 0205 octal
Unicode character name <control>
Unicode 1.0 character name (deprecated) NEXT LINE (NEL)
https://unicode-search.net/unicode-namesearch.pl
also included this information
HTML encoding … hexadecimal
… decimal
which gave me some vague hint, how it was possible, that … become ``. But this is not main problem here.
My question is: how is possible, that control character is shown up like this and what is the actual glyph used to represent it?
I tried to sketch into http://shapecatcher.com/ to identify it but without success. I did not find such a glyph in any Unicode table.

The alien symbol is not a Unicode character; but is in Microsoft's Webdings font, with character code 0x85. Running Start > Run > charmap, then selecting Webdings from the Font drop list, opens this window:
If I click that alien character in the leftmost column, the message Character Code : 0x85 is shown at the bottom of the window.
I can even copy that character from the Character Map and paste it into Microsoft Wordpad:
The WebDings symbols were included in Unicode Release 7: Pictographic symbols (including many emoji), geometric symbols, arrows, and ornaments originating from the Wingdings and Webdings sets. Therefore you would expect the alien symbol to also be in Unicode. However, I don't think the version of Webdings that was used included that alien symbol, since Windows 10 also has a ttf file for Webdings (version 5.01), and it also does not include the alien symbol:
So presumably what originally caught your attention was some text being rendered with an older version of the Webdings font which included that alien symbol.

The glyph is 👽 U+1F47D EXTRATERRESTRIAL ALIEN. I don't know why your system misrenders a control character.

Miscellaneous characters in xmgrace

xmgrace is wonderful, but it has some problems when dealing with miscellaneous characters.
How can I make the script small l ($\ell$ in latex) in xmgrace?

I believe the only way to do this is to specify a script-like system font. None of the standard ones are suitable so you will have to make sure that a suitable font is installed on your system.
You can change to any font by enclosing the name in
\f{}
e.g.
\f{Symbol}
or
\f{Century-Schoolbook-L-Bold_italic}
You can see a list of the available fonts (and their labels) by going to the Font tool in the Window menu of the xmgrace GUI.
After typing the special character you can return to your original font in a similar way, or by using \0 to get back to the default font 0.

How do I add a new Arabic vowel-sign in the PUA area of a font?

I am using Ubuntu 14.04, with FontForge compiled from the Git repo as of 31
July.
I'm trying to add a vowel-sign to an Arabic font, Graph, by Future Soft Egypt:
http://openfontlibrary.org/en/font/graph
I have added glyphs where the Unicode code-point already exists (eg peh,
U+067E), and that works fine. I am now trying to add a vowel sign where no
Unicode code-point exists - it is a "damma with tail", used by some writers in
Swahili to mean "o".
I decided to put it in the PUA at U+E909, and copied the font's damma (U+064F)
and added a tail:
http://kevindonnelly.org.uk/swahili/images/dammas.png
I generated the font, and set up the keyboard to emit that character.
The glyph comes up OK, but there are two problems, as can be seen here:
http://kevindonnelly.org.uk/swahili/images/output.png
showing at top "bubu", using the original damma, and at bottom "bobo", using
the new damma-with-tail.
(1) The damma-with-tail is too far to the left, even though the anchor points
in FF have not been moved.
(2) Worse, the damma-with-tail means that only the isolated versions of the
consonant glyphs get used - in the second line the two bs should be joined, as
in the first line.
I'm not sure whether this is a function of using the PUA, or whether it's due
to my missing some step I need to take in FF (eg the Encoding -> Add Encoding
Slots that needs to be done for the consonants), but if anyone could shed some
light on how to fix the two problems, I'd be very grateful.

What is this unicode invisible character?

While trying to parse some unicode text strings, I'm hitting an invisible character that I can't find any definition for. If I paste it in to a text editor and show invisibles, I can see that it looks like a bullet point (• alt-8), and by copy/pasting them, I can see it has an effect like a space or tab, but it's none of those.
I need to test for it, something like...
if(uniChar == L'\t')
But of course I need to provide something to match to.
It has bytes 0xc2 0xa0 in UTF-8.
If no-one has a definition, is there any devious way to test for something I can't define!?
(I happen to be using NSStrings in Objective-C, OSX, Xcode, but I don't think that has any bearing.)

Bytes C2 A0 in UTF-8 encode U+00A0 ɴᴏ-ʙʀᴇᴀᴋ sᴘᴀᴄᴇ, which can be used, for example, to display combining marks in isolation. It is as a named HTML entity. It is almost the same as a U+0020 sᴘᴀᴄᴇ, except it prevents line breaks before or after it, and acts as a numerical separator for bidirectional layout.
The dot you see when you ask a text editor to show invisibles just happens to be what glyph the text editor chose to display spaces. It does not mean the character in question is U+00B7 ᴍɪᴅᴅʟᴇ ᴅᴏᴛ, which is definitely not invisible.
In code, if you have it as a unichar, you can compare it to L'\x00A0'.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse