How does this font manage to display even in plain text? - encoding

I came across a piece of text that displays in a mystery font even when you view source in plain text: 𝓦𝓸𝓸𝓭
The word 'Wood' above appears, at least in Chrome, as a sort of caligraphic font when pasted in to Notepad or even the Google search bar.
Have tried to see if its base64 encoded characters, or quoted printable etc
𝓦𝓸𝓸𝓭
Can anyone identify how its done? Can it be done with a different font? Is it cross browser compatible?

Those characters are not being shown in a different font. They're in the same font as the rest of the page source.
The reason why they look strange is that they are not the ordinary letters 'W', 'o' and 'd' represented by the character values 0x57, 0x6f and 0x64. These characters are from the "Mathematical Alphanumeric Symbols" section of the font. Specifically they are the "Mathematical Bold Script Capital W", the "Mathematical Bold Script Small O" and the "Mathematical Bold Script Small D" characters represented by the values 0x1d4e6, 0x1d4f8 and 0x1d4ed. See https://unicode-table.com/en/blocks/mathematical-alphanumeric-symbols/ for a table of the characters in that section.
There's a good chance that any modern browser would show those characters just as you're seeing them. It comes down to whether the font that the browser uses to present the page includes glyphs for those character values.

Related

What is different between encoding and font

Encoding is maping that gives characters or symbols a unique value.
If a character is not present in encoding no matter what font you use it won't display correct fonts
Like Lucida console, arial or terminal
But problem is terminal font is showing line draw characters but other font is not showing line draw characters
My question is why terminal is behaving different to other font
Plz note
Windows 7
Locale English
For the impatient, the relevant link is at the bottom of this answer.
Encoding is maping that gives characters or symbols a unique value.
No, that are the specifics of a character-set, which maps certain characters to code points (using the Unicode terminology). Lets ignore the above for now.
If a character is not present in encoding no matter what font you use it won't display correct fonts Like Lucida console, arial or terminal
Font formats map Unicode code points to glyphs. Not all code points may be mapped for specific fonts - somebody has to create all these symbols. Again, lets ignore this.
Not all binary encodings may map to code points within a certain character set; this is possibly what you mean.
But problem is terminal font is showing line draw characters but other font is not showing line draw characters
Your terminal seems to operate on a different character set, probably the "OEM" or "IBM PC" character set instead of a Unicode compliant character set or Windows-1252 / ISO 8859-1 / Latin.
If it is the latter than you are out of luck unless you can set your output-terminal to another character set, as Windows-1252 doesn't support the box drawing characters at all.
Solutions:
If possible try and set the output to OEM / IBM PC character set.
If it is Unicode you can try and convert the output to Unicode: read it in (decode it) using the OEM character set and then re-encode it using the box drawing subset.

Display issue with diacritics for a phonetic alphabet

I need to write unicode characters and diacritics in a web page. They are part of a phonetic alphabet designed for romanist studies (the Bourciez Alphabet). My problem is a display issue, I believe: the character codes are all OK in unicode, but some diacritics are not displayed as expected.
Most notably, the 'COMBINING DOUBLE BREVE BELOW' (U+035C) does not display as expected: it appears not under the 2 letters to which it is supposed to apply, but under the last of those letters and the next character (another letter, or a space).
Here for instance, the combining diacritic should be under the first 2 "a" characters, but it is displayed under the 2nd and 3rd "a"; yet you can see that the combination has been applied to the first 2 "a"s, because they are displayed in smaller size than the normal "a"s:
result of combining double breve below
I'm using fonts which have those characters (I tried Arial MS Unicode, Gentium, and Lucida Sans Unicode). They all have the same display issue.
Any idea how I can solve this issue?
I'm having trouble reproducing the problem. Connecting two characters with the breve diacritic seems to be working for me. First I enter the first character of the pair, then the U+035C character, finally the second and it shows as follows.
sample image

Unicode Keystroke Characters?

Does unicode have characters in it similar to stuff like the things formed by the <kbd> tag in HTML? I want to use it as part of a game to indicate that the user can press a key to perform a certain action, for example:
Press R to reset, or S to open the settings menu.
Are there characters for that? I don't need anything fancy like ⇧ Shift or Tab ⇆, single-letter keys are plenty. I am looking for something that would work somewhat like the Enclosed Alphanumerics subrange.
If there are characters for that, where could I find a page describing them? All the google searches I tried turned only turned up "unicode character keyboard shortcuts" stuff.
If there are not characters for that, how can I display something like that as part of (or at least in line with) a text string in Processing 2.0.1?
(The rendering referred to is not the default rendering of kbd, which simply shows the content in the system’s default monospace font. But e.g. in StackOverflow pages, a style sheet is used to format kbd so that it looks like a keycap.)
Somewhat surprisingly, there is a Unicode way to create something that looks like a character in a keycap: enter the character, then immediately COMBINING ENCLOSING KEYCAP U+20E3.
Font support to this character is very limited but contains a few free fonts. Unfortunately, none of them is a sans-serif font, and the character to be shown inside should normally appear in such a font – after all, real keycaps contains very simple shapes for characters, without serifs. And generally, a character and an enclosing mark should be taken from the same font; otherwise they might be incompatible. However, it seems that taking the normal character from the sans-serif font (FreeSans) in GNU Freefont and the combining mark from the serif font (FreeSerif) of the same source creates a reasonable presentation:
I’m afraid it won’t work here in text, but I’ll try: A⃣ .
Whether this works depends on the use of suitable fonts, as mentioned, but also on the rendering software. Programs have been rather bad at displaying combining marks, but there has been some improvement. I tested this in Word 2007, where it works OK, and also on web browsers (Chrome, Firefox, IE) with good results using code like this:
<style>
.cap { font-family: FreeSerif; }
.cap span { font-family: FreeSans; }
</style>
<span class="cap"><span>A</span>⃣</span>
It isn’t perfect, when using the fonts mentioned. The character in the cap is not quite centered. Moreover, if I try to use the technique e.g. for the character Å (which is present on normal Nordic keyboards), the ring above A extends out of the cap. You could tweak this by setting the font size of the letter in the cap to, say, 85% of the font size of the combining mark, but then the horizontal position of the letter is even more off.
To summarize, it is possible to do such things at the character level, but if you can use other methods, like using a border or a background image for a character, you can probably achieve better rendering.

How to display cross symbol as superscript with unicode?

I want to display a cross in the place of superscript, I know the unicode character of the cross (\u2020).
Unicode encodes plain text. Superscripting isn’t plain text, so you need something external to, or “on top of” plain text. For example, on a web page, you could use the CSS to position character above the baseline (and reduce font size). In a word processor, you would use superscripting command or style.
In Unicode, there is a limited number of superscript characters, i.e. variants of characters in superscript style encoded as separate characters, such as superscript two “²”. But Unicode has no mechanism for superscripts in general.

Convert non english characters into Unicode (UTF-8)

I copied large amount of text from another system to my PC. When I viewed the text in my PC, it looked weird. So I copied all the fonts from the other PC and installed them in mine too. Now the text looks okay, but actually it seems that is not in Unicode. For example, if I copy the text and paste in another UTF-8 supported editor such as Notepad++, I get English characters ("bgah;") only like shown below.
How to convert this whole text into unicode text, like the one below. So I can copy the text and paste anywhere else.
பெயர்
The above text was manually obtained using http://www.google.com/transliterate/indic/Tamil
I need this conversion to be done, so I can copy them into database tables.
'Ja-01' is a font with a custom 'visual encoding'.
That is to say, the sequence of characters really is "bgah;" and it only looks like Tamil to you because the font's shapes for the Latin characters bg look like பெ.
This is always to be avoided, because by storing the content as "bgah;" you lose the ability to search and process it as real Tamil, but this approach was common in the pre-Unicode days especially for less-widespread scripts without mature encoding standards. This application probably predates widespread use of TSCII.
Because it is a custom encoding not shared by any other font, it is very unlikely you will be able to find a tool to convert content in this encoding to proper Unicode characters. It does not appear to be any standard character ordering, so you will have to look at the font (eg in charmap.exe) and note down every character, find the matching character in Unicode and map between them.
For example here's a trivial Python script to replace characters in a file:
mapping= {
u'a': u'\u0BAF', # Tamil letter Ya
u'b': u'\u0BAA', # Tamil letter Pa
u'g': u'\u0BC6', # Tamil vowel sign E (combining)
u'h': u'\u0BB0', # Tamil letter Ra
u';': u'\u0BCD', # Tamil sign virama (combining)
# fill in the rest of the mapping information here!
}
with open('ja01data.txt', 'rb') as fp:
data= fp.read().decode('utf-8')
for char in mapping:
data= data.replace(char, mapping[char])
with open('utf8data.txt', 'wb') as fp:
fp.write(data.encode('utf-8'))
The font you found is getting you into trouble. The actual cell text is "bgah;", it gets rendered to பெயர் because you found a font that can work with 8-bit non-Unicode characters. So reading it or pasting it into Notepad++ is going to produce "bgah;" since that's the real text. It can only ever be rendered properly again by forcing the program that displays the string to use that same font.
Ditch the font and enter Unicode so it looks like this:
"bgah" looks like a Baamini based system, which is pre-unicode. It was popular in Canada (and the SL Tamil diaspora in general) in the 90s.
As the others mentioned, it looks like a custom visual encoding that mimics the performance of a foreign script while maintaining ASCII encoding.
Google "Baamini to unicode convertor". The University of Colombo seems to have put one up: http://www.ucsc.cmb.ac.lk/ltrl/services/feconverter/?maps=t_b-u.xml
Let me know if this works. If not, I can ask around and get something for you.
You could first check whether the encoding is TSCII, as this sounds most probable. It is an 8-bit encoding, and the fonts you copied are probably based on that encoding. Check out whether the TSCII to UTF-8 converter at SourceForge is suitable. The project there is called “Any Tamil Encoding to Unicode” but they say that only TSCII is supported for now.