BOCU-1 for internal encoding of strings [closed] - unicode

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
Some languages/platforms like Java, Javascript, Windows, Dotnet, KDE etc. use UTF16. Some others prefer UTF8.
What is the reason that no language/platform uses BOCU-1? What is the rationale for JEP 254 and JEP 254 equivalent for Dotnet?
Is the reason that BOCU-1 is patented? Are there any technical reasons also?
Edit
My question is not about Java specifically. By JEP 254, I mean compact UTF-16 as mentioned in that proposal. My question is, since BOCU-1 is compact for almost any unicode string, why don't any language/platform use it internally, instead of UTF-16 or UTF-8. Such a usage would improve cache performance for any string, and not just ASCII or Latin-1.
Such a usage may also help in non-Latin programming language support in formats like The Language Server Index Format (LSIF).

What is the reason that no language/platform uses BOCU-1?
That question is far too broad in scope for Stack Overflow, and a concise answer is impossible.
However, in the specific case of Java note that someone raised the possibility of Java adopting BOCU-1 as an RFE (Request For Enhancement) in 2002. See JDK-4787935 (str) Reducing the memory footprint for Strings.
That bug was closed with a resolution of "Won't Fix" ten years later:
"Although this is a very interesting proposal, it is highly unlikely that BOCU or any other multi-byte encoding for internal use would be adopted. Furthermore, this comes down to a space-time tradeoff with unclear long-term consequences. Given the length of time this proposal has lingered, it seems appropriate to close it as will not fix".
What is the rationale for JEP 254...?
There is a section of JEP 254 titled "Motivation" which explains that, and in particular it states "most String objects contain only Latin-1 characters". However, if that does not satisfy you, raise a separate question.
Ensure that it is on topic for Stack Overflow by reviewing What topics can I ask about here? first. Two of the people who reviewed JEP 254 (Aleksey Shipilev and Brian Goetz) respond here on SO, so you may get an authoritative answer.
What is the rationale for ... JEP 254 equivalent for Dotnet?
Again, raise this as a separate SO question.
Is the reason that BOCU-1 is patented?
That question is specifically off topic here: "Legal questions, including questions about copyright or licensing, are off-topic for Stack Overflow", though Wikipedia notes "BOCU-1 is the only Unicode compression scheme described on the Unicode Web site that is known to be encumbered with intellectual property restrictions".
Are there any technical reasons also?
A very important non-technical reason is that the HTML5 specification explicitly forbids the use of BOCU-1!...
Avoid these encodings
The HTML5 specification calls out a number of encodings that you should avoid...
Documents must also not use CESU-8, UTF-7, BOCU-1, or SCSU encodings, since they... were never intended for Web content and the HTML5 specification forbids browsers from recognising them.
Of course that invites the question of why HTML 5 forbids the use of BOCU-1, and the only technical reason I can find for that is that this Mozilla documentation on HTML's <meta> element states:
Authors must not use CESU-8, UTF-7, BOCU-1 and/or SCSU as cross-site scripting attacks with these encodings have been demonstrated.
See this GitHub link for more details on the XSS vulnerability with BOCU-1.
Also note that in line with the the HTML5 specification, all the major browsers specifically do not support BOCU-1.

Related

Why are there multiple versions of Unicode? Why isn't everything UTF-8? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
Again and again, I keep asking myself: Why do they always insist on over-complicating everything?!
I've tried to read up about and understand Unicode many times over the years. When they start talking about endians and BOMs and all that stuff, my eyes just "zone out". I physically cannot keep reading and retain what I'm seeing. I fundamentally don't get their desire for over-complicating everything.
Why do we need UTF-16 and UTF-32 and "big endian" and "little endian" and BOMs and all this nonsense? Why wasn't Unicode just defined as "compatible with ASCII, but you can also use multiple bytes to represent all these further characters"? That would've been nice and simple, but nooo... let's have all this other stuff so that Microsoft chose UTF-16 for Windows NT and nothing is easy or straight-forward!
As always, there probably is a reason, but I doubt it's good enough to justify all this confusion and all these problems arising from insisting on making it so complex and difficult to grasp.
Unicode started out as a 16-bit character set, so naturally every character was simply encoded as two consecutive bytes. However, it quickly became clear that this would not suffice, so the limit was increased. The problem was that some programming languages and operating systems had already started implementing Unicode as 16-bit and they couldn’t just throw out everything they had already built, so a new encoding was devised that stayed backwards-compatible with these 16-bit implementations while still allowing full Unicode support. This is UTF-16.
UTF-32 represents every character as a sequence of four bytes, which is utterly impractical and virtually never used to actually store text. However, it is very useful when implementing algorithms that operate on individual codepoints – such as the various mechanisms defined by the Unicode standard itself – because all codepoints are always the same length and iterating over them becomes trivial, so you will sometimes find it used internally for buffers and such.
UTF-8 meanwhile is what you actually want to use to store and transmit text. It is compatible with ASCII and self-synchronising (unlike the other two) and it is quite space-efficient (unlike UTF-32). It will also never produce eight binary zeroes in a row (unless you are trying to represent the literal NULL character) so UTF-8 can safely be used in legacy environments where strings are null-terminated.
Endianness is just an intrinsic property of data types where the smallest significant unit is larger than one byte. Computers simply don’t always agree in what order to read a sequence of bytes. For Unicode, this problem can be circumvented by including a Byte Order Mark in the text stream, because if you read its byte representation in the wrong direction in UTF-16 or UTF-32, it will produce an invalid character that has no reason to ever occur, so you know that this particular order cannot be the right one.

Would adding a new currency symbol today be harder or easier than the addition of the euro? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
From a programming perspective, when the euro symbol was created, Extended Ascii had to accommodate it, as did Unicode and other code-page standards. Fonts and Printer languages had to add the new glyph. Accounting and reporting software had to deal with a new formatting symbol. Exchange and Money markets had to deal with the addition of the XEU and subsequently the EUR.
If a major new currency (of the same magnitude as the euro) was "created" today, similar changes would be required.
Could such a new currency attain the same elevation as the euro? For example, is there room in the Extended Ascii tables, and would a major new currency symbol be considered worthy of occupying a spot.
Would it be easier to add a new currency symbol today than it was at the euro's birth, or did the humble € enter the IT world during a sweet spot?
By extended ASCII, I'm assuming you're referring to it's inclusion in Latin-9? Introduction of a new ISO/IEC 8859 character set to accommodate new currency symbols seems highly unlikely these days; Unicode is far more prevalent than it was when Latin-9 was introduced. In terms of consideration and speed of inclusion into Unicode, that would depend on a fair number of factors, so it's hard to say.
The most recent proposed currency symbol inclusion that I'm aware of is for bitcoin, which seems likely to go ahead. But this is going to be a fair duration of time between proposal and widespread font support. Interestingly the lack of support for the symbol has resulted in a number of alternatives and approximations. New currency symbols may result in similar workarounds -- it's hard to say, though.
Given how ubiquitous Unicode is now, and in particular usage of UTF-8 on the web, I'd say new currency symbols have got a pretty good chance of relatively quick inclusion, but really it's font-support that will trail behind. The more important/prevalent the symbol, the more likely it is for font support to gain traction.
If a major new currency (of the same magnitude as the euro) was "created" today, similar changes would be required.
Unicode is deliberately designed with lots of space free for new symbols (eg U+20BE Lari Sign in Unicode 8.0). It takes a while for fonts to appear with glyphs for them, and for text processing systems to be able to categorise the characters correctly. But even without any special support, simple-layout characters like currency signs can be displayed (eg using embedded fonts) without any code changes to existing systems.
The Euro was planned at a time when Unicode support was still sporadic and incomplete enough that it was considered necessary to make the symbol available in Western 8-bit encodings (eg Windows code pages 1252–1258). This was a change to the meaning of existing code pages, which was a big load of hassle.
But Unicode is well-supported enough now that no-one would bother go back and change legacy encodings or add new ones just for a new currency symbol (even something as big as the Euro).

What is Unicode? and how Encoding works? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
Few hours before I was reading a c programming book. While I was reading the book I came across these words, Character encoding and Unicode. Then I started googling for the information about Unicode. Then I came to know that Unicode character set has every character from every language and UTF-8,16,32 can encode the characters listed in unicode character set.
but I was not able to understand how it works.
Does unicode depends upon the operating systems?
How it is related to softwares and programs?
Is UTF-8 is a software that is installed on my computer when i installed operating system?
or Is it related to hardware?
and how a computer encodes the things?
I have found it so much confusing. Please answer me in detail.
I am new to these things, so please keep that in mind while you give me the answer.
thank you.
I have written about this extensively in What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text. Here some highlights:
encodings are plentiful, encodings define how a "character" like "A" can be encoded as bits and bytes
most encodings only specify this for a small number of selected characters; for example all (or at least most) characters needed to write English or Czech; single byte encodings typically support a set of up to 256 characters
Unicode is one large standard effort which has catalogued and specified a number ⟷ character relationship for virtually all characters and symbols of every major language in use, which is hundreds of thousands of characters
UTF-8, 16 and 32 are different sub-standards for how to encode this ginormous catalog of numbers to bytes, each with different size tradeoffs
software needs to specifically support Unicode and its UTF-* encodings, just like it needs to support any other kind of specialized encoding; most of the work is done by the OS these days which exposes supporting functions to an application

how are non-english programming/scripting languages developed? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
how are non-english programming/scripting languages developed ?
do you need to be a computer scientist ?
You need to understand how Unicode works to build a parser in an international language, and yes you do need to be a CS major, or possess the ability to self-teach yourself compiler design.
Study unicode -- learn to use ICU -- or a language with GOOD Unicode support.
Decide on and Build a VM (or use an existing one).
Write a lexxer / parser or use something like ANTLR (Java based) .
decide on a AST
Generate the instruction stream for the VM.
check out "Principles of Compiler Design"
You use a character set capable of encoding extended characters, such as UTF8. Unicode sets above the 8 bit are written in double byte notation for UTF16 or quadruple byte notation for UTF32. The problem that arises is with regard to dibi, bidirectional notation, where language using different bidi notations may read the bytes in different orders. The solution to the bidi problem was through specification of the byte order prior to the character encoding, but the problem remains of what is before with regard to differences of bidi. So the byte order is clearly stated through a more specific subset of the Unicode character sets. UTF16BE, for big endian, mandates the byte order specification comes prior to the character encoding in a right to left interpretation. The opposite would be UTF16LE, or little endian.
There is also the UCS, Universal Character Set. This term is still used, but it is deprecated as it is not specific enough in concern for the problem mentioned above about characters whose mapping takes more than one byte. For information about the differences between UCS and Unicode please read this: http://en.wikipedia.org/wiki/Universal_Character_Set#Differences_between_ISO_10646_and_Unicode
Some examples are the following:
IRI - RFC 3987 - http://www.ietf.org/rfc/rfc3987.txt - mandates UTF8 encoding
Mail Markup Language - http://mailmarkup.org/ - mandates UTF16BE encoding

Need a list of languages that are supported completely by ASCII encoding [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
I am writing an article on Unicode and discussing the advantages of this encoding scheme over outdated methods like ASCII.
As part of my research I am looking for a reference that listed the languages that could be fully represented using only the characters supported by ASCII. Haven't had much luck tracking it down with Google and I thought I'd tap the collective knowledge of SO to see if anyone had a reasonable list.
Key points:
All languages listed must be able to
be completely represented using the character set available in ASCII.
I know this won't be comprehensive,
but I am mostly interested in the
most common written languages.
There are no natural languages that I know of that can be fully represented in ASCII. Even American English, the language for which ASCII was invented, doesn't work: for one, there are a lot of foreign words that have been integrated into the American English language that cannot be represented in ASCII, like resumé, naïve or a word that probably every programmer uses regularly, schönfinkeln.
And two, ASCII is missing pretty much all typographic characters like “quotation marks”, dashes of various lengths (– and —), ellipses (…), thin and wide spaces and so on, all of which are used in American English.
IIRC from my Latin classes, the macrons in Latin are later additions by people studying meters in Latin poetry; they wouldn't have been used in every-day writing. So you've got Latin.
Given loan words, I don't think there are any such languages. Even ugly Americans know the difference between "resume" and "résumé".
I assume you mean natural languages and only 7 bit ASCII?
In that case the list is quite small. Mostly english.
Some constructed languages such as Interlingua and Ido are designed to use only ASCII characters. ‘Real’ languages in everyday use tend to use characters outside the ASCII range, at the very least for loanwords.
Not a widely used language, but Rotokas can be written using only ASCII letters. See http://en.wikipedia.org/wiki/Rotokas_alphabet