remove bengali diacritics unicode php - unicode

Is there any way which prints the vowels of Bengali without the circle. I found a link which says printing the vowels by concatenating NBSP to it should work. It does, but not for the vowels which precedes a consonant (e.g. ো ে ি). I could not attach an image since I am new to this site. If anyone wants a visual representation of my question please let me know your email address, I will send you an email. Thanks in advance.

It is true that you should use a no-break space (NBSP) before a combining mark to show it in (apparent) isolation; this is specified in clause 7.9 Combining marks in the Unicode Standard, chapter 7 (the name of the chapter is misleading, since it has general information, too, in addition to dealing with European scripts). However, it depends on the rendering software and the font used whether this has the desired effect.
In an HTML document, a combination of, say, NBSP and U+09C7 BENGALI VOWEL SIGN E is shown as blank in Chrome. This is an odd bug, of course. On IE and Firefox, you mostly get the rendering with a dotted circle, apparently because the browser does not want to apply the combining mark to a base character from a different font. If you use, say, ে as such with no styling, then browsers typically pick up the no-break space from the Times New Roman and the Bengali character from another font, such as Vrinda. You can fix this by setting the font of the no-break space to the same as the Bengali character, e.g.
<p style="font-family: Vrinda"> ে
<p style="font-family: Sun-ExtA"> ে
<p style="font-family: Nirmala UI"> ে
<p style="font-family: FreeSerif"> ে
<p style="font-family: Code2000"> ে
<p style="font-family: Arial Unicode MS"> ে
<p style="font-family: ALPHA-Demo"> ে
In practice, you would this use a font-family value that is a suitable list of fonts. Of course, this would not work in computers that have none of the fonts listed. And it won’t work in Chrome (or in Opera, which displays the symbol with a dotted circle).
The conclusion is that unless you are targeting a specific audience with known browsers and fonts, you should probably present the characters as images.

If I understand the question correctly, it is about writing Bengali letters, not related to PHP or the web in general, in any particular way. As the Bengali vowel signs combine with consonants, it seems that if you want to use independent vowel signs, you should use characters like U+0993 BENGALI LETTER O “ও”.

Related

Unusual rendering and copy-paste for the character 誤

I'm seeing somewhat unusual behavior around the rendering of 誤 in the browser (works across both Firefox and Chrome), which I'm having trouble explaining.
Specifically, check out the Wiktionary page for 誤:
Notice that there are 3 variations marked in black bold:
The top left one has 3 pieces: 言 + ⼝ + 天
The middle one kinda' has 4 pieces: 言 + ⼝ + a rotated ꒔ + ⼤
The bottom one has 3 pieces: ⻈ + ⼝ + 天
The relation between 2 and 3 is clear: 2 represents the traditional character and 1 represents the simplified character. But what does 1 represent? I've tried the following:
I tried copying character 1 but when I paste it, it ends up looking like character 2.
I tried various font combinations, both in the browser and in TextEdit, but the appearance and copy-pasting behavior persist.
So what is going on with this unusual character rendering and copy-pasting behavior? How can I reproduce character 1 (and not 2) in other applications?
FWIW, when I look at a Chinese dictionary, the stroke order shows character 2 even though the browser renders the character as 1.
This is a z-variant, and in this case probably an example of Han unification.
From https://www.zdic.net/hans/%E8%AA%A4:
You can see that the first character (marked as 内地 Mainland China) is what you're getting in the headword.
The headword on Wikipedia is formatted with lang=zh, whereas the example sentences use zh-Hans and zh-Hant respectively, and that's the core of this, along with likely subtags fallback.
Most systems dealing with locales perform locale fallback using likely subtags: So, Hans without any country specified typically implies CN, and Hant implies TW during fallback. The reverse is also true (and some other countries like HK imply Hant as well). Hans/Hant are script codes for Simplified and Traditional Chinese, and CN/TW are China and Taiwan respectively. zh on its own implies zh-CN (and thus zh-Hans-CN)
Fallback also need not always occur the same way, different fonts have different priorities (e.g. a Mainland Chinese font may assume CN by default unless explicitly told otherwise)
I made a little table, screenshot showing the rendering of different language tags on my system when run on Wikipedia (snippet at the bottom of this post)
The font's actually defaulting to Noto Sans CJK JP unless I put it in a class=Hant context (where it switches to Noto Sans CJK TC).
What's happening under the hood is: traditional vs simplified is not unified in Unicode, but such variants are. Even though zh implies zh-Hans-CN, because this is a traditional character, the font will not use the Hans to pick a Simplified character: it must pick a traditional character since Simplified is encoded differently. So you get the Mainland Chinese traditional variant in zh contexts (like the headword), but since zh-Hant implies zh-TW, the font is happy to oblige and give you the Taiwanese (still traditional) variant in the example sentence.
Note that not all cases stick to a single font: sometimes the choice of language can force a different font to be selected (or the precise CSS used). Additionally, you can have z-variants crop up in different contexts without needing to change the language, for example the Cantonese possessive 嘅 can be built as ⿰口既 or ⿰口旣 and the choice is not clearly locale based and seems to vary freely between fonts.
Code for table above:
<table>
<tr lang=zh><td>zh</td><td>誤</td></tr>
<tr lang=zh-Hans><td>zh-Hans</td><td>誤</td></tr>
<tr lang=zh-Hant><td>zh-Hant</td><td>誤</td></tr>
<tr lang=zh-CN><td>zh-CN</td><td>誤</td></tr>
<tr lang=zh-Hant-CN><td>zh-Hant-CN</td><td>誤</td></tr>
<tr lang=zh-Hans-CN><td>zh-Hans-CN</td><td>誤</td></tr>
<tr lang=zh-TW><td>zh-TW</td><td>誤</td></tr>
<tr lang=zh-HK><td>zh-HK</td><td>誤</td></tr>
<tr lang=zh-Hans-TW><td>zh-Hans-TW</td><td>誤</td></tr>
<tr lang=ja><td>ja</td><td>誤</td></tr>
<tr lang=ko><td>ko</td><td>誤</td></tr>
<tr lang=vi><td>vi</td><td>誤</td></tr>
</table>
(Based on a Twitter discussion with manishearth)
The difference is coming up due to variations across fonts (called z-variants). Specifically, based on the language tag, the browser can pick different fonts within the same font family (e.g. sans-serif). For example, on my device:
With lang="zh", the browser picks PingFang SC from sans-serif.
With lang="zh-Hant", the browser picks PingFang TC from sans-serif.
These two fonts render the character differently. The lang tag is different in different parts of the HTML, causing different font selection and hence different rendering.
Outside the browser, depending on the language context, the variant/language can also change. There is more discussion of this with examples on the Han Unification Wikipedia page.

What range of unicode characters should be kept in a #font-face web font for a US based website with a US audience?

As part of optimizing a web development project, we need to strip out unnecessary characters that are never going to be used to reduce the size of font files. I have searched Google and found nothing canonical on the subject of which characters are required and which are safe to remove.
I've found the following ranges that may be of interest:
0020 — 007F Basic Latin
00A0 — 00FF Latin-1 Supplement
0100 — 017F Latin Extended-A
0180 — 024F Latin Extended-B
0250 — 02AF IPA Extensions
02B0 — 02FF Spacing Modifier Letters
0300 — 036F Combining Diacritical Marks
27C0 — 27EF Miscellaneous Mathematical Symbols-A
It seems that the most aggressive approach would be to only keep "Basic Latin", 0020 — 007F, which provides upper and lower-case letters, numbers and a few basic symbols, like the $, +, (, ), etc.
Latin-1 Supplement contains some extra goodies like Trademark and Copyright symbols and fractions.
Latin Extended-A and -B contain letters with accent marks, and since our copy is in English, I'm not sure if these will ever be needed.
If we use only that ranges (0020 — 007F) and (00A0 — 00FF), will we run into problems down the line with missing characters, should some user decide to post a comment in Spanish (for example)? Or will the browser fall back to a default font for characters that aren't included the web font?
The point of a web-font is to make the main bodies of text and headlines look pretty, which the basic latin set should cover, but I don't know if there are hidden "gotchas" with stripping down to just the "Basic Latin" range, like accented characters showing as diamond question marks instead of falling back to a system font, etc.
What range of unicode characters should be kept in a #font-face web
font for a US based website with a US audience? Are there any best practices or guidelines for striping unnecessary characters from a font for web use?
I would recommend subsetting to one of the common "code page" definitions that support US/Western Europe. Most code page definitions pre-date Unicode and typically have the bits and pieces needed for various regional support without including entire Unicode blocks. Suggestions:
Windows Code Page 1252
ISO/IEC 8859-1 "Latin 1"*
ISO/IEC 8859-15
*This is the same as Unicode Ranges 0020-007F Basic Latin + 00A0-00FF Latin-1 Supplement
These include much more than is strictly required for US English, though as noted above, several accented characters commonly appear in English text (é, ñ, as well as other punctuation marks and symbols). These sets include those characters, so you should be in good shape for the vast majority of text for a U.S. audience. Note also that in most fonts, these characters are typically "composites", which means that they use a reference to the components (e.g. 'é' is built from references to 'e' and '´'); as such, they don't normally require as much size to store them, so retaining them usually won't incur a major size penalty.
If you might encounter European financial text, I'd suggest either Windows 1252 or ISO/IEC 8859-15 which include the Euro currency symbol.
I don't know if there are hidden "gotchas" with stripping down to just the "Basic Latin" range, like accented characters showing as diamond question marks instead of falling back to a system font
Any characters that don't exist in the font you are using will fall back to any default font the browser can find with the characters in. This will likely be ugly when interleaved with other characters from your custom font, but modern OSes provide decent font coverage for commonly-used characters from the above blocks so typically it will still be readable.
So you should include characters based on whether you think they'll be used commonly enough that having them rendered in an ugly font is a deal-breaker. For what it's worth, a pretty minimal set I have used before for a similar purpose is ¡£°±²³¿ÉËÑéëñ‘’“”–—•€™, but your site's exactly requirements may vary. (For example, if you coöpted the New-Yorker-style diaeresis you would certainly want äëïöü.)
(How exactly default fallback fonts work varies between browsers and was famously troublesome in older versions of IE, and IE Mobile. But the basic accented Latin letters are pretty safe.)

Unicode Keystroke Characters?

Does unicode have characters in it similar to stuff like the things formed by the <kbd> tag in HTML? I want to use it as part of a game to indicate that the user can press a key to perform a certain action, for example:
Press R to reset, or S to open the settings menu.
Are there characters for that? I don't need anything fancy like ⇧ Shift or Tab ⇆, single-letter keys are plenty. I am looking for something that would work somewhat like the Enclosed Alphanumerics subrange.
If there are characters for that, where could I find a page describing them? All the google searches I tried turned only turned up "unicode character keyboard shortcuts" stuff.
If there are not characters for that, how can I display something like that as part of (or at least in line with) a text string in Processing 2.0.1?
(The rendering referred to is not the default rendering of kbd, which simply shows the content in the system’s default monospace font. But e.g. in StackOverflow pages, a style sheet is used to format kbd so that it looks like a keycap.)
Somewhat surprisingly, there is a Unicode way to create something that looks like a character in a keycap: enter the character, then immediately COMBINING ENCLOSING KEYCAP U+20E3.
Font support to this character is very limited but contains a few free fonts. Unfortunately, none of them is a sans-serif font, and the character to be shown inside should normally appear in such a font – after all, real keycaps contains very simple shapes for characters, without serifs. And generally, a character and an enclosing mark should be taken from the same font; otherwise they might be incompatible. However, it seems that taking the normal character from the sans-serif font (FreeSans) in GNU Freefont and the combining mark from the serif font (FreeSerif) of the same source creates a reasonable presentation:
I’m afraid it won’t work here in text, but I’ll try: A⃣ .
Whether this works depends on the use of suitable fonts, as mentioned, but also on the rendering software. Programs have been rather bad at displaying combining marks, but there has been some improvement. I tested this in Word 2007, where it works OK, and also on web browsers (Chrome, Firefox, IE) with good results using code like this:
<style>
.cap { font-family: FreeSerif; }
.cap span { font-family: FreeSans; }
</style>
<span class="cap"><span>A</span>⃣</span>
It isn’t perfect, when using the fonts mentioned. The character in the cap is not quite centered. Moreover, if I try to use the technique e.g. for the character Å (which is present on normal Nordic keyboards), the ring above A extends out of the cap. You could tweak this by setting the font size of the letter in the cap to, say, 85% of the font size of the combining mark, but then the horizontal position of the letter is even more off.
To summarize, it is possible to do such things at the character level, but if you can use other methods, like using a border or a background image for a character, you can probably achieve better rendering.

is there a Unicode character for Copy and Paste?

Are there Unicode characters that represent Copy and Paste? Perhaps in Unicode 6?
(there are scissor symbols that can be used fittingly to represent Cut (e.g. ✂ U+2702) but i could never find one to represent Copy or Paste.)
How about this: &#x2398 = ⎘ which looks kind of like a copy from clipboard.
For Paste, the CLIPBOARD symbol (U+1F4CB) would be likely;
📋
My solution is to use two 📄 emojis and layer them over each other like so:
<span style="font-size: .875em; margin-right: .125em; position: relative; top: -.25em; left: -.125em">
📄<span style="position: absolute; top: .25em; left: .25em">📄</span>
</span>
(Of course I'm a web developer, so I have access to HTML. You might need to accomplish this another way. But I'm guessing a good chunk of people looking for an answer to this UI problem are using unicode in a website.)
The neat thing about this solution is if the thing you're copying is better represented by an icon other than 📄, you might be able to switch out the emoji for something else.
The scissor ✂️ and clipboard 📋 emojis are then suitable cut/paste companions.
I use scissors character for “cut out” on site — https://unicode-table.com/
To copy I use two squares but you could also use — Two Consecutive Equals Signs or Mahjong Tile Two of Bamboos.
To paste I use - Clipboard characters.
These characters correspond to characters in word.
Was looking as well, found these alternatives: ☍ ⊕ ⎘ ⩲ ⨧ ⑃ ended up using ⎘ like suggested above
There is an insertion symbol: x2380 ⎀
Unicode encodes characters used in texts, not ideas or concepts. So unless there is a character commonly used in texts to symbolize cut and paste, you shouldn’t expect to find such a symbol in Unicode.
U+2702 BLACK SCISSORS is in Unicode since it appears in older character codes, and it has been used in printed documents to indicate a cutting line, as a “cut here” indicator, rather than as a symbol of copying.
An (HTML-only) solution using emojis from the modern set*
Concept
HEX character code
Literal Emoji
Copy = Camera
📷
📷
Cut = Scissors
✂️
✂️
Paste = Clipboard
📋
📋
Emojis vary in appearance depending on device and font in use

What's the unicode glyph used to indicate combining characters?

My application needs to display "orphaned" combining characters. I would like to use the same format as the "official" unicode charts, using the dotted circle placeholder. See, for example:
Combining Diacritical Marks (PDF)
A quick scan through the charts and I came up with U+25CC "DOTTED CIRCLE". That looks good, but the note on this character reads:
note that the reference glyph for this
character is intentionally larger than
the dotted circle glyph used to
indicate combining characters in this
standard; see, for example, 0300
Which says (I think) that U+25CC is not the correct character. (Or, if it is, perhaps just a poorly worded note.)
So: if the dotted circle used on the "Combining Diacritical Marks" is not U+25CC, what is the correct code for that little booger?
I have tried:
Copying the text from the PDF and inspecting it, but the copy is disabled in the PDF.
Emailing it to myself in Gmail and then viewing the attachment as HTML, but there is gets converted to U+0024 ("DOLLAR SIGN"). Which means that either the conversion failed or they are just playing some font rendering games in the PDF.
[Clarification] I realize that the U+25CC looks OK (assuming one's font supports it), but it sounds like the spec says that this is the wrong character. Many unicode characters have similar glyphs but are different characters, semantically speaking. "Latin Capital Letter A" (U+0041) and "Greek Capital Letter Alpha" (U+0391) will look identical for most fonts, but they have different semantic meanings and are not interchangable.
I don't think there is an official placeholder character. The way I read that note, they chose U+25CC arbitrarily, purely for display purposes. Then, in the chart where the "real" dotted circle is listed, they made it a little larger to emphasize that it's not being used as a placeholder there. (Or maybe they shrunk it in the other charts; as you said, the note's poorly worded.)
Whatever the case, I don't see any reason not to use U+25CC as your placeholder.
Just tried this: create a blank .html file, copy the text, and load in Firefox. Displays as expected (although I really didn't expect space+combining character to display correctly):
<html>
<body>
<font size="24pt">
◌̀
◌́
◌̂
◌̃
<br/>
À
Á
Â
Ã
<br/>
̀
́
̂
̃
</font>
</body>
</html>