Best way to display "input too long" errors with unicode input?

Best way to display "input too long" errors with unicode input? - unicode

Does anyone have good suggestions for displaying "max length exceeded" errors to a user when a single character doesn't equal one byte?
I'm at a loss for words, but I found a quote that's more eloquent:
If the buffer runs over by three bytes, what do you tell the user? Three bytes could be one, two, or three characters that the user needs to trim. Depending on which characters they trim, the result might still be too long. And recall that the user's perception of "a character" is probably closer to a grapheme or grapheme cluster than to a character. So they might delete too many characters without realizing it. Finally, if the buffer limit is small (like 10 or 20), some languages like Chinese will be severely restricted on the number of characters permitted.
A couple of constraints I'm under is that its a form-driven website and the underlying database columns sizes can't change (the quote-page suggests having a 40byte buffer and enforcing a 10character limit).

My favorite way to solve this problem is to highlight the portion of the input that exceeds the maximum length. This provides a visual cue as to which part makes it "too long", without having to get into the specifics of how many bytes or characters it was.
If you can use Javascript (for example, if you don't need to meet 508 standards), I also like monitoring the length of the field and alerting the user when it's too long (while still doing server-side validation, of course).
If you don't want to get into complex CSS inside the input field, you can just reproduce the bad input below the field and highlight it there.

The obvious correct answer is to not limit the text length.
But if you can't tell the user how many characters they have to play with, don't. Simply tell them when the string is too long. Keep track of how many bytes the current string would require, and if that is above your limit, enable a warning message for the user.

Just thinking out loud... why not be less specific: "maximum length exceeded by N" (eg. "maximum length exceeded by 4"). You don't tell the user what the max length is... just that they are N over it. And you don't tell the user what N represents (bytes)... when they see the message "too long by 3"... they will remove at least 3 characters (even though they could be removing 9 actual bytes).
I figure there just no way to explain to users why certain "characters" require multiple bytes that has a high probability of not confusing them.

Good question. Not sure there's a good answer, other than changing the schema to use Unicode characters instead of bytes. For example in SQL Server with NVARCHAR or MySQL with UTF-8 collations columns are limited by character length. That's pushing it a bit regarding “column lengths can't change” of course, even if they're technically the same ‘length’.
For what it's worth, East Asian users will be used to the idea that a character isn't a byte, because there's a long-standing tradition of ‘half-width’ latin characters taking up half as much storage and screen space as the Chinese ideographs.
You can't really generally expect anyone to grok UTF-8 byte numbers though. Perhaps at the client side you could do it purely visually, using an ‘amount used’ bar instead of a number of bytes:
<style type="text/css">
.field { width: 12em; }
.field input { width: 100%; }
.field input { box-sizing: border-box; -moz-box-sizing: border-box; -ms-box-sizing: border-box; -webkit-box-sizing: border-box; -khtml-box-sizing: border-box; }
.indicator { background: blue; height: 5px; }
.indicator-over { background: red; height: 5px; }
</style>
<div class="field">
<input type="text" name="pwd" class="limited-12">
</div>
<script type="text/javascript">
function limitInput(element, limit) {
var indicator= document.createElement('div');
element.parentNode.insertBefore(indicator, element.nextSibling);
element.onchange=element.onkeyup= function() {
var utf8= unescape(encodeURIComponent(element.value));
indicator.className= utf8.length>limit? 'indicator-over' : 'indicator';
var used= Math.min(utf8.length/limit, 1);
indicator.style.width= Math.floor(used*100)+'%';
}
element.onchange();
}
var inputs= document.getElementsByTagName('input');
for (var i= inputs.length; i-->0;)
if (inputs[i].className.substring(0, 8)=='limited-')
limitInput(inputs[i], parseInt(inputs[i].className.substring(8)));
</script>

Related

Unusual rendering and copy-paste for the character 誤

I'm seeing somewhat unusual behavior around the rendering of 誤 in the browser (works across both Firefox and Chrome), which I'm having trouble explaining.
Specifically, check out the Wiktionary page for 誤:
Notice that there are 3 variations marked in black bold:
The top left one has 3 pieces: 言 + ⼝ + 天
The middle one kinda' has 4 pieces: 言 + ⼝ + a rotated ꒔ + ⼤
The bottom one has 3 pieces: ⻈ + ⼝ + 天
The relation between 2 and 3 is clear: 2 represents the traditional character and 1 represents the simplified character. But what does 1 represent? I've tried the following:
I tried copying character 1 but when I paste it, it ends up looking like character 2.
I tried various font combinations, both in the browser and in TextEdit, but the appearance and copy-pasting behavior persist.
So what is going on with this unusual character rendering and copy-pasting behavior? How can I reproduce character 1 (and not 2) in other applications?
FWIW, when I look at a Chinese dictionary, the stroke order shows character 2 even though the browser renders the character as 1.

This is a z-variant, and in this case probably an example of Han unification.
From https://www.zdic.net/hans/%E8%AA%A4:
You can see that the first character (marked as 内地 Mainland China) is what you're getting in the headword.
The headword on Wikipedia is formatted with lang=zh, whereas the example sentences use zh-Hans and zh-Hant respectively, and that's the core of this, along with likely subtags fallback.
Most systems dealing with locales perform locale fallback using likely subtags: So, Hans without any country specified typically implies CN, and Hant implies TW during fallback. The reverse is also true (and some other countries like HK imply Hant as well). Hans/Hant are script codes for Simplified and Traditional Chinese, and CN/TW are China and Taiwan respectively. zh on its own implies zh-CN (and thus zh-Hans-CN)
Fallback also need not always occur the same way, different fonts have different priorities (e.g. a Mainland Chinese font may assume CN by default unless explicitly told otherwise)
I made a little table, screenshot showing the rendering of different language tags on my system when run on Wikipedia (snippet at the bottom of this post)
The font's actually defaulting to Noto Sans CJK JP unless I put it in a class=Hant context (where it switches to Noto Sans CJK TC).
What's happening under the hood is: traditional vs simplified is not unified in Unicode, but such variants are. Even though zh implies zh-Hans-CN, because this is a traditional character, the font will not use the Hans to pick a Simplified character: it must pick a traditional character since Simplified is encoded differently. So you get the Mainland Chinese traditional variant in zh contexts (like the headword), but since zh-Hant implies zh-TW, the font is happy to oblige and give you the Taiwanese (still traditional) variant in the example sentence.
Note that not all cases stick to a single font: sometimes the choice of language can force a different font to be selected (or the precise CSS used). Additionally, you can have z-variants crop up in different contexts without needing to change the language, for example the Cantonese possessive 嘅 can be built as ⿰口既 or ⿰口旣 and the choice is not clearly locale based and seems to vary freely between fonts.
Code for table above:
<table>
<tr lang=zh><td>zh</td><td>誤</td></tr>
<tr lang=zh-Hans><td>zh-Hans</td><td>誤</td></tr>
<tr lang=zh-Hant><td>zh-Hant</td><td>誤</td></tr>
<tr lang=zh-CN><td>zh-CN</td><td>誤</td></tr>
<tr lang=zh-Hant-CN><td>zh-Hant-CN</td><td>誤</td></tr>
<tr lang=zh-Hans-CN><td>zh-Hans-CN</td><td>誤</td></tr>
<tr lang=zh-TW><td>zh-TW</td><td>誤</td></tr>
<tr lang=zh-HK><td>zh-HK</td><td>誤</td></tr>
<tr lang=zh-Hans-TW><td>zh-Hans-TW</td><td>誤</td></tr>
<tr lang=ja><td>ja</td><td>誤</td></tr>
<tr lang=ko><td>ko</td><td>誤</td></tr>
<tr lang=vi><td>vi</td><td>誤</td></tr>
</table>

(Based on a Twitter discussion with manishearth)
The difference is coming up due to variations across fonts (called z-variants). Specifically, based on the language tag, the browser can pick different fonts within the same font family (e.g. sans-serif). For example, on my device:
With lang="zh", the browser picks PingFang SC from sans-serif.
With lang="zh-Hant", the browser picks PingFang TC from sans-serif.
These two fonts render the character differently. The lang tag is different in different parts of the HTML, causing different font selection and hence different rendering.
Outside the browser, depending on the language context, the variant/language can also change. There is more discussion of this with examples on the Han Unification Wikipedia page.

How do I convert three digit hexadecimal String to Color in flutter

I want to convert "#0ff" this hexadecimal string into Color
I have searched it But I only found answers with the hexadecimal string with length 6 digits
If I do like this Color(0xFF0ff) It displays nothing
Thanks in advance!

Web defines colours in three ways: explicit colour name, 3 hex digits, or 6 hex digits.
If you have 3 digits, you should "double" every digit (as character, not as value), so #abc is to be read as #aabbcc. Why? It is short, and it helped to select fewer colours (in the olden time we have 8 or 16bit colours), and screens may not be so accurate.
Do your #0ff should be read as #00ffff.
Note: These 3 digit colours are now considered obsolote, see https://html.spec.whatwg.org/multipage/common-microsyntaxes.html#simple-colour, but still used on many places (e.g. #fff for white).

The "accepted" answer by GC is incorrect in so many ways, in fact the only thing he got right is that #0ff = #00ffff.
First, the current standard for color for CSS and web content is here:
https://www.w3.org/TR/css-color-3/#numerical
#0ff is a perfectly legal and not obsolete color definition.
The permissible forms for defining sRGB color for the web are:
* { color: #f00; } /* #rgb */
* { color: #ff0000; } /* #rrggbb */
* { color: rgb(255,0,0); } /* each value is 0-255 */
* { color: rgb(100%, 0%, 0%); } /* each value is 0%-100% */
* { color: hsl(0, 100%, 50%); } /* red using hsl "hue sat lightness" */
* { color: "red"; } /* color keyword */
Not shown above: transparency or alpha syntax.
String Theory
For the first two in the list, the color value is parsed as a STRING data type and not a numeric value.
If I do like this Color(0xFF0ff) It displays nothing
When you Color(0xFF0ff) it was invalid and ignored. 0x123456 is a numeric value in hex, which equals the integer value 1193046 to write it as a string that is valid for a CSS-type of color definition, you need to write it as #123456
Three's Company, Six is the Clone. Or Sith. Or Something...
In the "three digit" format, each hex value is duplicated (cloned LOL), so #abc becomes #aabbcc.
For a math definition, 0xCC = 0xC * 0x10 + 0xC so if we have the hex numeric 0xABC:
0xAABBCC = 0xA * 0x110000 + 0xB * 0x1100 + 0xC * 0x11
So, if you are working with numeric data types, then usex = 0x123456, but if you are sending strings to some other API, then x = #123456 ... And also remember that if you need to accept #123456, that it's going to be a string unless you convert it to a numeric, easy in JS with parseInt(x,base); where the base of course is 16.
Side note to comment on a different answer:
GC said:
Why? It is short, and it helped to select fewer colours (in the olden time we have 8 or 16bit colours), and screens may not be so accurate.
None of this is relevant. #rgb and #rrggbb have been in the CSS standard since the very beginning in the mid 90s, and #rgb has ALWAYS expanded to #rrggbb, regardless of system capability, and in the mid 90s 8bit per channel (aka 24bit color) was the norm as it still is today.
Moreover, color is a part of design and presentation, and design should be abstracted to CSS, and not in the HTML which is for content structure.
The fact that the HTML spec wants to "simplify" color declarations has more to do with efficiency in parsing the DOM, and nothing you mentioned. Colors should not be in HTML at all, they should in the CSS, and I assure you #rgb is not "obsolete" it is very much a part of the latest CSS color standard approved a month ago, linked at the beginning of this post.
The "actual why" is three digit color is easier to use
#fea is a lot easier to remember than #fcecb0 and yet they are nearly identical in appearance.
The full #rrggbb is useful when you need to exactly match an 8 bit sRGB color such as in an image. Otherwise, there is no issue with using #rgb instead of #rrggbb, and there are plenty of reasons to do so, none of which have anything to do with bit depth, nor for that matter "screen accuracy" which is still up to user adjustment and calibration now as it was then.

Unicode Keystroke Characters?

Does unicode have characters in it similar to stuff like the things formed by the <kbd> tag in HTML? I want to use it as part of a game to indicate that the user can press a key to perform a certain action, for example:
Press R to reset, or S to open the settings menu.
Are there characters for that? I don't need anything fancy like ⇧ Shift or Tab ⇆, single-letter keys are plenty. I am looking for something that would work somewhat like the Enclosed Alphanumerics subrange.
If there are characters for that, where could I find a page describing them? All the google searches I tried turned only turned up "unicode character keyboard shortcuts" stuff.
If there are not characters for that, how can I display something like that as part of (or at least in line with) a text string in Processing 2.0.1?

(The rendering referred to is not the default rendering of kbd, which simply shows the content in the system’s default monospace font. But e.g. in StackOverflow pages, a style sheet is used to format kbd so that it looks like a keycap.)
Somewhat surprisingly, there is a Unicode way to create something that looks like a character in a keycap: enter the character, then immediately COMBINING ENCLOSING KEYCAP U+20E3.
Font support to this character is very limited but contains a few free fonts. Unfortunately, none of them is a sans-serif font, and the character to be shown inside should normally appear in such a font – after all, real keycaps contains very simple shapes for characters, without serifs. And generally, a character and an enclosing mark should be taken from the same font; otherwise they might be incompatible. However, it seems that taking the normal character from the sans-serif font (FreeSans) in GNU Freefont and the combining mark from the serif font (FreeSerif) of the same source creates a reasonable presentation:
I’m afraid it won’t work here in text, but I’ll try: A⃣ .
Whether this works depends on the use of suitable fonts, as mentioned, but also on the rendering software. Programs have been rather bad at displaying combining marks, but there has been some improvement. I tested this in Word 2007, where it works OK, and also on web browsers (Chrome, Firefox, IE) with good results using code like this:
<style>
.cap { font-family: FreeSerif; }
.cap span { font-family: FreeSans; }
</style>
<span class="cap"><span>A</span>⃣</span>
It isn’t perfect, when using the fonts mentioned. The character in the cap is not quite centered. Moreover, if I try to use the technique e.g. for the character Å (which is present on normal Nordic keyboards), the ring above A extends out of the cap. You could tweak this by setting the font size of the letter in the cap to, say, 85% of the font size of the combining mark, but then the horizontal position of the letter is even more off.
To summarize, it is possible to do such things at the character level, but if you can use other methods, like using a border or a background image for a character, you can probably achieve better rendering.

is there a Unicode character for Copy and Paste?

Are there Unicode characters that represent Copy and Paste? Perhaps in Unicode 6?
(there are scissor symbols that can be used fittingly to represent Cut (e.g. ✂ U+2702) but i could never find one to represent Copy or Paste.)

How about this: &#x2398 = ⎘ which looks kind of like a copy from clipboard.

For Paste, the CLIPBOARD symbol (U+1F4CB) would be likely;
📋

My solution is to use two 📄 emojis and layer them over each other like so:
<span style="font-size: .875em; margin-right: .125em; position: relative; top: -.25em; left: -.125em">
📄<span style="position: absolute; top: .25em; left: .25em">📄</span>
</span>
(Of course I'm a web developer, so I have access to HTML. You might need to accomplish this another way. But I'm guessing a good chunk of people looking for an answer to this UI problem are using unicode in a website.)
The neat thing about this solution is if the thing you're copying is better represented by an icon other than 📄, you might be able to switch out the emoji for something else.
The scissor ✂️ and clipboard 📋 emojis are then suitable cut/paste companions.

I use scissors character for “cut out” on site — https://unicode-table.com/
To copy I use two squares but you could also use — Two Consecutive Equals Signs or Mahjong Tile Two of Bamboos.
To paste I use - Clipboard characters.
These characters correspond to characters in word.

Was looking as well, found these alternatives: ☍ ⊕ ⎘ ⩲ ⨧ ⑃ ended up using ⎘ like suggested above

There is an insertion symbol: x2380 ⎀

Unicode encodes characters used in texts, not ideas or concepts. So unless there is a character commonly used in texts to symbolize cut and paste, you shouldn’t expect to find such a symbol in Unicode.
U+2702 BLACK SCISSORS is in Unicode since it appears in older character codes, and it has been used in printed documents to indicate a cutting line, as a “cut here” indicator, rather than as a symbol of copying.

An (HTML-only) solution using emojis from the modern set*
Concept
HEX character code
Literal Emoji
Copy = Camera
📷
📷
Cut = Scissors
✂️
✂️
Paste = Clipboard
📋
📋
Emojis vary in appearance depending on device and font in use

What's the unicode glyph used to indicate combining characters?

My application needs to display "orphaned" combining characters. I would like to use the same format as the "official" unicode charts, using the dotted circle placeholder. See, for example:
Combining Diacritical Marks (PDF)
A quick scan through the charts and I came up with U+25CC "DOTTED CIRCLE". That looks good, but the note on this character reads:
note that the reference glyph for this
character is intentionally larger than
the dotted circle glyph used to
indicate combining characters in this
standard; see, for example, 0300
Which says (I think) that U+25CC is not the correct character. (Or, if it is, perhaps just a poorly worded note.)
So: if the dotted circle used on the "Combining Diacritical Marks" is not U+25CC, what is the correct code for that little booger?
I have tried:
Copying the text from the PDF and inspecting it, but the copy is disabled in the PDF.
Emailing it to myself in Gmail and then viewing the attachment as HTML, but there is gets converted to U+0024 ("DOLLAR SIGN"). Which means that either the conversion failed or they are just playing some font rendering games in the PDF.
[Clarification] I realize that the U+25CC looks OK (assuming one's font supports it), but it sounds like the spec says that this is the wrong character. Many unicode characters have similar glyphs but are different characters, semantically speaking. "Latin Capital Letter A" (U+0041) and "Greek Capital Letter Alpha" (U+0391) will look identical for most fonts, but they have different semantic meanings and are not interchangable.

I don't think there is an official placeholder character. The way I read that note, they chose U+25CC arbitrarily, purely for display purposes. Then, in the chart where the "real" dotted circle is listed, they made it a little larger to emphasize that it's not being used as a placeholder there. (Or maybe they shrunk it in the other charts; as you said, the note's poorly worded.)
Whatever the case, I don't see any reason not to use U+25CC as your placeholder.

Just tried this: create a blank .html file, copy the text, and load in Firefox. Displays as expected (although I really didn't expect space+combining character to display correctly):
<html>
<body>
<font size="24pt">
◌̀
◌́
◌̂
◌̃
<br/>
À
Á
Â
Ã
<br/>
̀
́
̂
̃
</font>
</body>
</html>

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse