What's the 254 symbol in ascii, is it ■ or þ? - unicode

Looking at this page https://www.asciitable.com I see that the 254 symbol is a filled square (■). But here: https://www.ascii-code.com it says it's the small letter thorn (þ).
When using the symbol in code as 0xFE on a Mac using the Swift language I get the thorn letter (þ), then how do I get the square (■) in ascii?
Using this tool https://www.rapidtables.com/convert/number/ascii-to-hex.html and (■) as input I get a number larger than 255, so it can't be represented in ascii then?

Neither.
ASCII determines characters for codes 0-127. There is no default "Extended ASCII". Meaning of all other codes depends on actual encoding. In your second link the used encoding is CP-1252 as it clearly states there. In your third link you get Unicode code for BLACK SQUARE as a fallback, there is no ASCII code for the black square anyway.

Related

Subscripted 'y' in unicode

I have to display $CₓH\subscript{y}$.
Is there any chance to display a subscripted 'y' in Unicode?
\u2093 represents the subscripted 'x'
Usually you do this with formatting. Unicode's selection of superscript and subscript characters doesn't stem from the need or desire to cover whole alphabets but rather to enable specific use cases, e.g. writing IPA. Furthermore, if you're using a good OpenType font it can also support proper subscripts for arbitrary characters at the font level (where a glyph isn't simply scaled down by the layout engine, but rather a specifically-designed subscript glyph from the font is used).
In fact, since you're already using TeX or something vaguely similar to it, just let one of the many implementations render it. There are lots of things you simply cannot do in plain text without formatting, and this is one of them.
The subscript and superscript characters in Unicode do not cover the whole alphabet.
See the Wiki article on this topic or this answer on SO.
In Sublime Text this subscripted y works: ᵧ. Copied from here: https://lingojam.com/SubscriptGenerator
EDIT This is actually the greek letter gamma

Why is unicode written like U+0000? [duplicate]

Why do Unicode code points appear as U+<codepoint>?
For example, U+2202 represents the character ∂.
Why not U- (dash or hyphen character) or anything else?
The characters “U+” are an ASCIIfied version of the MULTISET UNION “⊎” U+228E character (the U-like union symbol with a plus sign inside it), which was meant to symbolize Unicode as the union of character sets. See Kenneth Whistler’s explanation in the Unicode mailing list.
The Unicode Standard needs some notation for talking about code points and character names. It adopted the convention of "U+" followed by four or more hexadecimal digits at least as far back as The Unicode Standard, version 2.0.0, published in 1996 (source: archived PDF copy on Unicode Consortium web site).
The "U+" notation is useful. It gives a way of marking hexadecimal digits as being Unicode code points, instead of octets, or unrestricted 16-bit quantities, or characters in other encodings. It works well in running text. The "U" suggests "Unicode".
My personal recollection from early-1990's software industry discussions about Unicode, is that a convention of "U+" followed by four hexadecimal digits was common during the Unicode 1.0 and Unicode 2.0 era. At the time, Unicode was seen as a 16-bit system. With the advent of Unicode 3.0 and the encoding of characters at code points of U+010000 and above, the convention of "U-" followed by six hexadecimal digits came in to use, specifically to highlight the extra two digits in the number. (Or maybe it was the other way around, a shift from "U-" to "U+".) In my experience, the "U+" convention is now much more common than the "U-" convention, and few people use the difference between "U+" and "U-" to indicate the number of digits.
I wasn't able to find documentation of the shift from "U+" to "U-", though. Archived mailing list messages from the 1990's should have evidence of it, but I can't conveniently point to any. The Unicode Standard 2.0 declared, "Unicode character codes have a uniform width of 16 bits." (p. 2-3). It laid down its convention that "an individual Unicode value is expressed as U+nnnn, where nnnn is a four digit number in hexadecimal notation" (p. 1-5). Surrogate values were allocated, but no character codes were defined above U+FFFF, and there was no mention of UTF-16 or UTF-32. It used "U+" with four digits. The Unicode Standard 3.0.0, published in 2000, defined UTF-16 (p. 46-47) and discussed code points of U+010000 and above. It used "U+" with four digits in some places, and with six digits in other places. The firmest trace I found was in The Unicode Standard, version 6.0.0, where a table of BNF syntax notation defines symbols U+HHHH and U-HHHHHHHH (p. 559).
The "U+" notation is not the only convention for representing Unicode code points or code units. For instance, the Python language defines the following string literals:
u'xyz' to indicate a Unicode string, a sequence of Unicode characters
'\uxxxx' to indicate a string with a unicode character denoted by four hex digits
'\Uxxxxxxxx' to indicate a string with a unicode character denoted by eight hex digits
It depends on what version of the Unicode standard you are talking about. From Wikipedia:
Older versions of the standard used
similar notations, but with slightly
different rules. For example, Unicode
3.0 used "U-" followed by eight digits, and allowed "U+" to be used
only with exactly four digits to
indicate a code unit, not a code
point.
It is just a convention to show that the value is Unicode. A bit like '0x' or 'h' for hex values (0xB9 or B9h). Why 0xB9 and not 0hB9 (or &hB9 or $B9)? Just because that's how the coin flipped :-)

What's the difference between ASCII and Unicode?

What's the exact difference between Unicode and ASCII?
ASCII has a total of 128 characters (256 in the extended set).
Is there any size specification for Unicode characters?
ASCII defines 128 characters, which map to the numbers 0–127. Unicode defines (less than) 221 characters, which, similarly, map to numbers 0–221 (though not all numbers are currently assigned, and some are reserved).
Unicode is a superset of ASCII, and the numbers 0–127 have the same meaning in ASCII as they have in Unicode. For example, the number 65 means "Latin capital 'A'".
Because Unicode characters don't generally fit into one 8-bit byte, there are numerous ways of storing Unicode characters in byte sequences, such as UTF-32 and UTF-8.
Understanding why ASCII and Unicode were created in the first place helped me understand the differences between the two.
ASCII, Origins
As stated in the other answers, ASCII uses 7 bits to represent a character. By using 7 bits, we can have a maximum of 2^7 (= 128) distinct combinations*. Which means that we can represent 128 characters maximum.
Wait, 7 bits? But why not 1 byte (8 bits)?
The last bit (8th) is used for avoiding errors as parity bit.
This was relevant years ago.
Most ASCII characters are printable characters of the alphabet such as abc, ABC, 123, ?&!, etc. The others are control characters such as carriage return, line feed, tab, etc.
See below the binary representation of a few characters in ASCII:
0100101 -> % (Percent Sign - 37)
1000001 -> A (Capital letter A - 65)
1000010 -> B (Capital letter B - 66)
1000011 -> C (Capital letter C - 67)
0001101 -> Carriage Return (13)
See the full ASCII table over here.
ASCII was meant for English only.
What? Why English only? So many languages out there!
Because the center of the computer industry was in the USA at that
time. As a consequence, they didn't need to support accents or other
marks such as á, ü, ç, ñ, etc. (aka diacritics).
ASCII Extended
Some clever people started using the 8th bit (the bit used for parity) to encode more characters to support their language (to support "é", in French, for example). Just using one extra bit doubled the size of the original ASCII table to map up to 256 characters (2^8 = 256 characters). And not 2^7 as before (128).
10000010 -> é (e with acute accent - 130)
10100000 -> á (a with acute accent - 160)
The name for this "ASCII extended to 8 bits and not 7 bits as before" could be just referred as "extended ASCII" or "8-bit ASCII".
As #Tom pointed out in his comment below there is no such thing as "extended ASCII" yet this is an easy way to refer to this 8th-bit trick. There are many variations of the 8-bit ASCII table, for example, the ISO 8859-1, also called ISO Latin-1.
Unicode, The Rise
ASCII Extended solves the problem for languages that are based on the Latin alphabet... what about the others needing a completely different alphabet? Greek? Russian? Chinese and the likes?
We would have needed an entirely new character set... that's the rational behind Unicode. Unicode doesn't contain every character from every language, but it sure contains a gigantic amount of characters (see this table).
You cannot save text to your hard drive as "Unicode". Unicode is an abstract representation of the text. You need to "encode" this abstract representation. That's where an encoding comes into play.
Encodings: UTF-8 vs UTF-16 vs UTF-32
This answer does a pretty good job at explaining the basics:
UTF-8 and UTF-16 are variable length encodings.
In UTF-8, a character may occupy a minimum of 8 bits.
In UTF-16, a character length starts with 16 bits.
UTF-32 is a fixed length encoding of 32 bits.
UTF-8 uses the ASCII set for the first 128 characters. That's handy because it means ASCII text is also valid in UTF-8.
Mnemonics:
UTF-8: minimum 8 bits.
UTF-16: minimum 16 bits.
UTF-32: minimum and maximum 32 bits.
Note:
Why 2^7?
This is obvious for some, but just in case. We have seven slots available filled with either 0 or 1 (Binary Code).
Each can have two combinations. If we have seven spots, we have 2 * 2 * 2 * 2 * 2 * 2 * 2 = 2^7 = 128 combinations. Think about this as a combination lock with seven wheels, each wheel having two numbers only.
Source: Wikipedia, this great blog post and Mocki.co where I initially posted this summary.
ASCII has 128 code points, 0 through 127. It can fit in a single 8-bit byte, the values 128 through 255 tended to be used for other characters. With incompatible choices, causing the code page disaster. Text encoded in one code page cannot be read correctly by a program that assumes or guessed at another code page.
Unicode came about to solve this disaster. Version 1 started out with 65536 code points, commonly encoded in 16 bits. Later extended in version 2 to 1.1 million code points. The current version is 6.3, using 110,187 of the available 1.1 million code points. That doesn't fit in 16 bits anymore.
Encoding in 16-bits was common when v2 came around, used by Microsoft and Apple operating systems for example. And language runtimes like Java. The v2 spec came up with a way to map those 1.1 million code points into 16-bits. An encoding called UTF-16, a variable length encoding where one code point can take either 2 or 4 bytes. The original v1 code points take 2 bytes, added ones take 4.
Another variable length encoding that's very common, used in *nix operating systems and tools is UTF-8, a code point can take between 1 and 4 bytes, the original ASCII codes take 1 byte the rest take more. The only non-variable length encoding is UTF-32, takes 4 bytes for a code point. Not often used since it is pretty wasteful. There are other ones, like UTF-1 and UTF-7, widely ignored.
An issue with the UTF-16/32 encodings is that the order of the bytes will depend on the endian-ness of the machine that created the text stream. So add to the mix UTF-16BE, UTF-16LE, UTF-32BE and UTF-32LE.
Having these different encoding choices brings back the code page disaster to some degree, along with heated debates among programmers which UTF choice is "best". Their association with operating system defaults pretty much draws the lines. One counter-measure is the definition of a BOM, the Byte Order Mark, a special codepoint (U+FEFF, zero width space) at the beginning of a text stream that indicates how the rest of the stream is encoded. It indicates both the UTF encoding and the endianess and is neutral to a text rendering engine. Unfortunately it is optional and many programmers claim their right to omit it so accidents are still pretty common.
java provides support for Unicode i.e it supports all world wide alphabets. Hence the size of char in java is 2 bytes. And range is 0 to 65535.
ASCII has 128 code positions, allocated to graphic characters and control characters (control codes).
Unicode has 1,114,112 code positions. About 100,000 of them have currently been allocated to characters, and many code points have been made permanently noncharacters (i.e. not used to encode any character ever), and most code points are not yet assigned.
The only things that ASCII and Unicode have in common are: 1) They are character codes. 2) The 128 first code positions of Unicode have been defined to have the same meanings as in ASCII, except that the code positions of ASCII control characters are just defined as denoting control characters, with names corresponding to their ASCII names, but their meanings are not defined in Unicode.
Sometimes, however, Unicode is characterized (even in the Unicode standard!) as “wide ASCII”. This is a slogan that mainly tries to convey the idea that Unicode is meant to be a universal character code the same way as ASCII once was (though the character repertoire of ASCII was hopelessly insufficient for universal use), as opposite to using different codes in different systems and applications and for different languages.
Unicode as such defines only the “logical size” of characters: Each character has a code number in a specific range. These code numbers can be presented using different transfer encodings, and internally, in memory, Unicode characters are usually represented using one or two 16-bit quantities per character, depending on character range, sometimes using one 32-bit quantity per character.
ASCII and Unicode are two character encodings. Basically, they are standards on how to represent difference characters in binary so that they can be written, stored, transmitted, and read in digital media. The main difference between the two is in the way they encode the character and the number of bits that they use for each. ASCII originally used seven bits to encode each character. This was later increased to eight with Extended ASCII to address the apparent inadequacy of the original. In contrast, Unicode uses a variable bit encoding program where you can choose between 32, 16, and 8-bit encodings. Using more bits lets you use more characters at the expense of larger files while fewer bits give you a limited choice but you save a lot of space. Using fewer bits (i.e. UTF-8 or ASCII) would probably be best if you are encoding a large document in English.
One of the main reasons why Unicode was the problem arose from the many non-standard extended ASCII programs. Unless you are using the prevalent page, which is used by Microsoft and most other software companies, then you are likely to encounter problems with your characters appearing as boxes. Unicode virtually eliminates this problem as all the character code points were standardized.
Another major advantage of Unicode is that at its maximum it can accommodate a huge number of characters. Because of this, Unicode currently contains most written languages and still has room for even more. This includes typical left-to-right scripts like English and even right-to-left scripts like Arabic. Chinese, Japanese, and the many other variants are also represented within Unicode. So Unicode won’t be replaced anytime soon.
In order to maintain compatibility with the older ASCII, which was already in widespread use at the time, Unicode was designed in such a way that the first eight bits matched that of the most popular ASCII page. So if you open an ASCII encoded file with Unicode, you still get the correct characters encoded in the file. This facilitated the adoption of Unicode as it lessened the impact of adopting a new encoding standard for those who were already using ASCII.
Summary:
1.ASCII uses an 8-bit encoding while Unicode uses a variable bit encoding.
2.Unicode is standardized while ASCII isn’t.
3.Unicode represents most written languages in the world while ASCII does not.
4.ASCII has its equivalent within Unicode.
Taken From: http://www.differencebetween.net/technology/software-technology/difference-between-unicode-and-ascii/#ixzz4zEjnxPhs
Storage
Given numbers are only for storing 1 character
ASCII ⟶ 27 bits (1 byte)
Extended ASCII ⟶ 28 bits (1 byte)
UTF-8 ⟶ minimum 28, maximum 232 bits (min 1, max 4 bytes)
UTF-16 ⟶ minimum 216, maximum 232 bits (min 2, max 4 bytes)
UTF-32 ⟶ 232 bits (4 bytes)
Usage (as of Feb 2020)
ASCII defines 128 characters, as Unicode contains a repertoire of more than 120,000 characters.
Beyond how UTF is a superset of ASCII, another good difference to know between ASCII and UTF is in terms of disk file encoding and data representation and storage in random memory. Programs know that given data should be understood as an ASCII or UTF string either by detecting special byte order mark codes at the start of the data, or by assuming from programmer intent that the data is text and then checking it for patterns that indicate it is in one text encoding or another.
Using the conventional prefix notation of 0x for hexadecimal data, basic good reference is that ASCII text starts with byte values 0x00 to 0x7F representing one of the possible ASCII character values. UTF text is normally indicated by starting with the bytes 0xEF 0xBB 0xBF for UTF8. For UTF16, start bytes 0xFE 0xFF, or 0xFF 0xFE are used, with the endian-ness order of the text bytes indicated by the order of the start bytes. The simple presence of byte values that are not in the ASCII range of possible byte values also indicates that data is probably UTF.
There are other byte order marks that use different codes to indicate data should be interpreted as text encoded in a certain encoding standard.

Why does the unicode Superscripts and Subscripts block not contain simple sequences of all letters?

The arrangement of the characters that can be used as super-/subscript letters seems completely chaotic. Most of them are obviously not meant to be used as sup/subscr. letters, but even those which are do not hint a very reasonable ordering. In Unicode 6.0 there is now at last an alphabetically-ordered subset of the subscript letters h-t in U+2095 through U+209C, but this was obviously rather squeezed into the remaining space in the block and encompasses less than 1/3 of all letters.
Why did the consortium not just allocate enough space for at least one sup and one subscript alphabet in lower case?
The disorganization in the arrangement of these characters is because they were encoded piecemeal as scripts that used them were encoded, and as round-trip compatibility with other character sets was added. Chapter 15 of the Unicode Standard has some discussion of their origins: for example superscript digits 1 to 3 were in ISO Latin-1 while the others were encoded to support the MARC-8 bibliographic character set (see table here); and U+2071 SUPERSCRIPT LATIN SMALL LETTER I and U+207F SUPERSCRIPT LATIN SMALL LETTER N were encoded to support the Uralic Phonetic Alphabet.
The Unicode Consortium have a general policy of not encoding characters unless there's some evidence that people are using the characters to make semantic distinctions that require encoding. So characters won't be encoded just to complete the set, or to make things look neat.

Why is 'U+' used to designate a Unicode code point?

Why do Unicode code points appear as U+<codepoint>?
For example, U+2202 represents the character ∂.
Why not U- (dash or hyphen character) or anything else?
The characters “U+” are an ASCIIfied version of the MULTISET UNION “⊎” U+228E character (the U-like union symbol with a plus sign inside it), which was meant to symbolize Unicode as the union of character sets. See Kenneth Whistler’s explanation in the Unicode mailing list.
The Unicode Standard needs some notation for talking about code points and character names. It adopted the convention of "U+" followed by four or more hexadecimal digits at least as far back as The Unicode Standard, version 2.0.0, published in 1996 (source: archived PDF copy on Unicode Consortium web site).
The "U+" notation is useful. It gives a way of marking hexadecimal digits as being Unicode code points, instead of octets, or unrestricted 16-bit quantities, or characters in other encodings. It works well in running text. The "U" suggests "Unicode".
My personal recollection from early-1990's software industry discussions about Unicode, is that a convention of "U+" followed by four hexadecimal digits was common during the Unicode 1.0 and Unicode 2.0 era. At the time, Unicode was seen as a 16-bit system. With the advent of Unicode 3.0 and the encoding of characters at code points of U+010000 and above, the convention of "U-" followed by six hexadecimal digits came in to use, specifically to highlight the extra two digits in the number. (Or maybe it was the other way around, a shift from "U-" to "U+".) In my experience, the "U+" convention is now much more common than the "U-" convention, and few people use the difference between "U+" and "U-" to indicate the number of digits.
I wasn't able to find documentation of the shift from "U+" to "U-", though. Archived mailing list messages from the 1990's should have evidence of it, but I can't conveniently point to any. The Unicode Standard 2.0 declared, "Unicode character codes have a uniform width of 16 bits." (p. 2-3). It laid down its convention that "an individual Unicode value is expressed as U+nnnn, where nnnn is a four digit number in hexadecimal notation" (p. 1-5). Surrogate values were allocated, but no character codes were defined above U+FFFF, and there was no mention of UTF-16 or UTF-32. It used "U+" with four digits. The Unicode Standard 3.0.0, published in 2000, defined UTF-16 (p. 46-47) and discussed code points of U+010000 and above. It used "U+" with four digits in some places, and with six digits in other places. The firmest trace I found was in The Unicode Standard, version 6.0.0, where a table of BNF syntax notation defines symbols U+HHHH and U-HHHHHHHH (p. 559).
The "U+" notation is not the only convention for representing Unicode code points or code units. For instance, the Python language defines the following string literals:
u'xyz' to indicate a Unicode string, a sequence of Unicode characters
'\uxxxx' to indicate a string with a unicode character denoted by four hex digits
'\Uxxxxxxxx' to indicate a string with a unicode character denoted by eight hex digits
It depends on what version of the Unicode standard you are talking about. From Wikipedia:
Older versions of the standard used
similar notations, but with slightly
different rules. For example, Unicode
3.0 used "U-" followed by eight digits, and allowed "U+" to be used
only with exactly four digits to
indicate a code unit, not a code
point.
It is just a convention to show that the value is Unicode. A bit like '0x' or 'h' for hex values (0xB9 or B9h). Why 0xB9 and not 0hB9 (or &hB9 or $B9)? Just because that's how the coin flipped :-)