What is different between encoding and font - encoding

Encoding is maping that gives characters or symbols a unique value.
If a character is not present in encoding no matter what font you use it won't display correct fonts
Like Lucida console, arial or terminal
But problem is terminal font is showing line draw characters but other font is not showing line draw characters
My question is why terminal is behaving different to other font
Plz note
Windows 7
Locale English

For the impatient, the relevant link is at the bottom of this answer.
Encoding is maping that gives characters or symbols a unique value.
No, that are the specifics of a character-set, which maps certain characters to code points (using the Unicode terminology). Lets ignore the above for now.
If a character is not present in encoding no matter what font you use it won't display correct fonts Like Lucida console, arial or terminal
Font formats map Unicode code points to glyphs. Not all code points may be mapped for specific fonts - somebody has to create all these symbols. Again, lets ignore this.
Not all binary encodings may map to code points within a certain character set; this is possibly what you mean.
But problem is terminal font is showing line draw characters but other font is not showing line draw characters
Your terminal seems to operate on a different character set, probably the "OEM" or "IBM PC" character set instead of a Unicode compliant character set or Windows-1252 / ISO 8859-1 / Latin.
If it is the latter than you are out of luck unless you can set your output-terminal to another character set, as Windows-1252 doesn't support the box drawing characters at all.
Solutions:
If possible try and set the output to OEM / IBM PC character set.
If it is Unicode you can try and convert the output to Unicode: read it in (decode it) using the OEM character set and then re-encode it using the box drawing subset.

Related

How does this font manage to display even in plain text?

I came across a piece of text that displays in a mystery font even when you view source in plain text: 𝓦𝓸𝓸𝓭
The word 'Wood' above appears, at least in Chrome, as a sort of caligraphic font when pasted in to Notepad or even the Google search bar.
Have tried to see if its base64 encoded characters, or quoted printable etc
𝓦𝓸𝓸𝓭
Can anyone identify how its done? Can it be done with a different font? Is it cross browser compatible?
Those characters are not being shown in a different font. They're in the same font as the rest of the page source.
The reason why they look strange is that they are not the ordinary letters 'W', 'o' and 'd' represented by the character values 0x57, 0x6f and 0x64. These characters are from the "Mathematical Alphanumeric Symbols" section of the font. Specifically they are the "Mathematical Bold Script Capital W", the "Mathematical Bold Script Small O" and the "Mathematical Bold Script Small D" characters represented by the values 0x1d4e6, 0x1d4f8 and 0x1d4ed. See https://unicode-table.com/en/blocks/mathematical-alphanumeric-symbols/ for a table of the characters in that section.
There's a good chance that any modern browser would show those characters just as you're seeing them. It comes down to whether the font that the browser uses to present the page includes glyphs for those character values.

Where are the unicode characters on the disk and what's the mapping process?

There are several unicode relevant questions has been confusing me for some time.
For these reasons as follow I think the unicode characters are existed on disk.
Execute echo "\u6211" in terminal, it will print the glyph corresponding to the unicode code point U+6211.
There's a concept of UCD (unicode character database), and We can download it's latest version. UCD latest
Some new version unicode characters like latest emojis can not display on my mac until I upgrade macOS version.
So if the unicode characters does existed on the disk , then :
Where is it ?
How can I upgrade it ?
What's the process of mapping the unicode code point to a glyph ?
If I use a specific font, then what's the process of mapping the unicode code point to a glyph ?
If not, then what's the process of mapping the unicode code point to a glyph ?
It will very appreciated if someone could shed light on these problems.
Execute echo "\u6211" in terminal, it will print the glyph corresponding to the unicode code point U+6211.
That's echo -e in bash.
› echo "\u6211"
\u6211
› echo -e "\u6211"
我
Where is it ?
In the font file.
Some new version unicode characters like latest emojis can not display on my mac until I upgrade macOS version.
How can I upgrade it ?
Installing/upgrading a suitable font with the emojis should be enough. I don't have macOS, so I cannot verify this.
I use "Noto Color Emoji" version 2.011/20180424, it works fine.
What's the process of mapping the unicode code point to a glyph ?
The application (e.g. text editor) provides the font rendering subsystem (Quartz? on macOS) with Unicode text and a font name. The font renderer analyses the codepoints of the text and decides whether this is simple text (e.g. Latin, Chinese, stand-alone emojis) or complex text (e.g. Latin with many marks, Thai, Arabic, emojis with zero-width joiners). The renderer finds the corresponding outlines in the font file. If the file does not have the required glyph, the renderer may use a similar font, or use a configured fallback font for a poor substitute (white box, black question mark etc.). Then the outlines undergo shaping to compose a complex glyph and line-breaking. Finally, the font renderer hands off the result to the display system.
Apart from the shaping, very little of this has to do with Unicode or encoding. Font rendering already used to work that way before Unicode existed, of course font files and rendering was much simpler 30 years ago. Encoding only matters when someone wants to load or save text from an application.
Summary: investigate
Truetype/Opentype font editing software so you can see what's contained in the files
font renderers, on Linux look at the libraries pango and freetype.
Generally speaking, operating system components that use text use the Unicode character set. In particular, font files use the Unicode character set. But, not all font files support all the Unicode codepoints.
When a codepoint is not supported by one font, the system might fallback to another that does. This is particularly true of web browsers. But ultimately if the codepoint is not supported, an unfilled rectangle is rendered. (There is no character for that because it's not a character. In fact, if you were able to copy and paste it as text, it should be the original character that couldn't be rendered.)
In web development, the web page can either supply or give the location of fonts that should work for the codepoints it uses.
Other programs typically use the operating system's rendering facilities and therefore the fonts available through it. How to install a font in an operating system is not a programming question (unless you are including a font in an installer for your program). For more information on that, you could see if the question fits with the Ask Different (Apple) Stack Exchange site.

What character is this:?

EDIT
While posting the question, character I ask for was shown well to me, but after postig it does not show up anymore. As it does not appear, please look up in original site
EDIT2
I looked for Unicode chars associated with "alien", and found no matching ones. Here is how they are compared side by side:
I found, that some texts inside my database contain character like . I am not sure, how it would rendered with different fonts and environments, so here is the image, how I see it:
I tried to identify it with different ways. For example, when I paste it into Sublime Text, it automatically shows as control character <0x85>. When I tried to identify it in different unicode-detectors (http://www.babelstone.co.uk/Unicode/whatisit.html, https://unicode-table.com/en/, https://unicode-search.net/unicode-namesearch.pl), their conclusion is pretty match the same:
Uni­code code point char­acter U+0085
UTF-8 en­co­ding c2 85 hexa­decimal
194 133 deci­mal
0302 0205 octal
Uni­co­de char­ac­ter name <control>
Uni­co­de 1.0 char­act­er name (de­pre­ca­ted) NEXT LINE (NEL)
https://unicode-search.net/unicode-namesearch.pl
also included this information
HTML en­co­ding … … hexa­decimal
… … deci­mal
which gave me some vague hint, how it was possible, that … become ``. But this is not main problem here.
My question is: how is possible, that control character is shown up like this and what is the actual glyph used to represent it?
I tried to sketch into http://shapecatcher.com/ to identify it but without success. I did not find such a glyph in any Unicode table.
The alien symbol is not a Unicode character; but is in Microsoft's Webdings font, with character code 0x85. Running Start > Run > charmap, then selecting Webdings from the Font drop list, opens this window:
If I click that alien character in the leftmost column, the message Character Code : 0x85 is shown at the bottom of the window.
I can even copy that character from the Character Map and paste it into Microsoft Wordpad:
The WebDings symbols were included in Unicode Release 7: Pictographic symbols (including many emoji), geometric symbols, arrows, and ornaments originating from the Wingdings and Webdings sets. Therefore you would expect the alien symbol to also be in Unicode. However, I don't think the version of Webdings that was used included that alien symbol, since Windows 10 also has a ttf file for Webdings (version 5.01), and it also does not include the alien symbol:
So presumably what originally caught your attention was some text being rendered with an older version of the Webdings font which included that alien symbol.
The glyph is 👽 U+1F47D EXTRATERRESTRIAL ALIEN. I don't know why your system misrenders a control character.

Convert non english characters into Unicode (UTF-8)

I copied large amount of text from another system to my PC. When I viewed the text in my PC, it looked weird. So I copied all the fonts from the other PC and installed them in mine too. Now the text looks okay, but actually it seems that is not in Unicode. For example, if I copy the text and paste in another UTF-8 supported editor such as Notepad++, I get English characters ("bgah;") only like shown below.
How to convert this whole text into unicode text, like the one below. So I can copy the text and paste anywhere else.
பெயர்
The above text was manually obtained using http://www.google.com/transliterate/indic/Tamil
I need this conversion to be done, so I can copy them into database tables.
'Ja-01' is a font with a custom 'visual encoding'.
That is to say, the sequence of characters really is "bgah;" and it only looks like Tamil to you because the font's shapes for the Latin characters bg look like பெ.
This is always to be avoided, because by storing the content as "bgah;" you lose the ability to search and process it as real Tamil, but this approach was common in the pre-Unicode days especially for less-widespread scripts without mature encoding standards. This application probably predates widespread use of TSCII.
Because it is a custom encoding not shared by any other font, it is very unlikely you will be able to find a tool to convert content in this encoding to proper Unicode characters. It does not appear to be any standard character ordering, so you will have to look at the font (eg in charmap.exe) and note down every character, find the matching character in Unicode and map between them.
For example here's a trivial Python script to replace characters in a file:
mapping= {
u'a': u'\u0BAF', # Tamil letter Ya
u'b': u'\u0BAA', # Tamil letter Pa
u'g': u'\u0BC6', # Tamil vowel sign E (combining)
u'h': u'\u0BB0', # Tamil letter Ra
u';': u'\u0BCD', # Tamil sign virama (combining)
# fill in the rest of the mapping information here!
}
with open('ja01data.txt', 'rb') as fp:
data= fp.read().decode('utf-8')
for char in mapping:
data= data.replace(char, mapping[char])
with open('utf8data.txt', 'wb') as fp:
fp.write(data.encode('utf-8'))
The font you found is getting you into trouble. The actual cell text is "bgah;", it gets rendered to பெயர் because you found a font that can work with 8-bit non-Unicode characters. So reading it or pasting it into Notepad++ is going to produce "bgah;" since that's the real text. It can only ever be rendered properly again by forcing the program that displays the string to use that same font.
Ditch the font and enter Unicode so it looks like this:
"bgah" looks like a Baamini based system, which is pre-unicode. It was popular in Canada (and the SL Tamil diaspora in general) in the 90s.
As the others mentioned, it looks like a custom visual encoding that mimics the performance of a foreign script while maintaining ASCII encoding.
Google "Baamini to unicode convertor". The University of Colombo seems to have put one up: http://www.ucsc.cmb.ac.lk/ltrl/services/feconverter/?maps=t_b-u.xml
Let me know if this works. If not, I can ask around and get something for you.
You could first check whether the encoding is TSCII, as this sounds most probable. It is an 8-bit encoding, and the fonts you copied are probably based on that encoding. Check out whether the TSCII to UTF-8 converter at SourceForge is suitable. The project there is called “Any Tamil Encoding to Unicode” but they say that only TSCII is supported for now.

FreeType 2 - Unicode Character Codes?

The documentation for FreeType2 says that the default character map used is the Unicode map... However, when I attempt to retrieve character code for Unicode 'T', it gives me Unicode 'Z' using:
glyph_index = FT_Get_Char_Index(face, text[n]);
What I really need is a way to find out how many glyphs are in the font face and what their Unicode value maps to per each one. Is there any way to do this. I've tried almost every FreeType function and can't get good results.
Thanks
I know this is old but...
What you ask is not possible. There are glyphs that don't match to any Unicode codepoint, and there are Unicode codepoints that map to multiple glyphs, depending on the neighboring glyph. For example, the "ff" in many fonts is a special glyph to make typesetting work better. There is no "ff" codepoint in Unicode. It's up to your layout system to decide to use an "ff" glyph or not.
However, if you're asking for the 'T' character and getting a 'Z' glyph index, then there are probably issues with your font.