Why is a SHA-1 Hash 40 characters long if it is only 160 bit?

Why is a SHA-1 Hash 40 characters long if it is only 160 bit? - encoding

The title of the question says it all. I have been researching SHA-1 and most places I see it being 40 Hex Characters long which to me is 640bit. Could it not be represented just as well with only 10 hex characters 160bit = 20byte. And one hex character can represent 2 byte right? Why is it twice as long as it needs to be? What am I missing in my understanding.
And couldn't an SHA-1 be even just 5 or less characters if using Base32 or Base36 ?

One hex character can only represent 16 different values, i.e. 4 bits. (16 = 24)
40 × 4 = 160.
And no, you need much more than 5 characters in base-36.
There are totally 2160 different SHA-1 hashes.
2160 = 1640, so this is another reason why we need 40 hex digits.
But 2160 = 36160 log362 = 3630.9482..., so you still need 31 characters using base-36.

I think the OP's confusion comes from a string representing a SHA1 hash takes 40 bytes (at least if you are using ASCII), which equals 320 bits (not 640 bits).
The reason is that the hash is in binary and the hex string is just an encoding of that. So if you were to use a more efficient encoding (or no encoding at all), you could take only 160 bits of space (20 bytes), but the problem with that is it won't be binary safe.
You could use base64 though, in which case you'd need about 27-28 bytes (or characters) instead of 40 (see this page).

There are two hex characters per 8-bit-byte, not two bytes per hex character.
If you are working with 8-bit bytes (as in the SHA-1 definition), then a hex character encodes a single high or low 4-bit nibble within a byte. So it takes two such characters for a full byte.

My answer only differs from the previous ones in my theory as to the EXACT origin of the OP's confusion, and in the baby steps I provide for elucidation.
A character takes up different numbers of bytes depending on the encoding used (see here). There are a few contexts these days when we use 2 bytes per character, for example when programming in Java (here's why). Thus 40 Java characters would equal 80 bytes = 640 bits, the OP's calculation, and 10 Java characters would indeed encapsulate the right amount of information for a SHA-1 hash.
Unlike the thousands of possible Java characters, however, there are only 16 different hex characters, namely 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E and F. But these are not the same as Java characters, and take up far less space than the encodings of the Java characters 0 to 9 and A to F. They are symbols signifying all the possible values represented by just 4 bits:
0 0000 4 0100 8 1000 C 1100
1 0001 5 0101 9 1001 D 1101
2 0010 6 0110 A 1010 E 1110
3 0011 7 0111 B 1011 F 1111
Thus each hex character is only half a byte, and 40 hex characters gives us 20 bytes = 160 bits - the length of a SHA-1 hash.

2 hex characters mak up a range from 0-255, i.e. 0x00 == 0 and 0xFF == 255. So 2 hex characters are 8 bit, which makes 160 bit for your SHA digest.

SHA-1 is 160 bits
That translates to 20 bytes = 40 hex characters (2 hex characters per byte)

Related

Variable byte encoding - Information Retrieval

Question:
What is the largest gap that can be encoded in 2 bytes using the variable-
byte encoding ?
Answer:
With 2 bytes, we use 2 continuation bits, and 14 bits are available for gap encoding (2^0 to 2^13). Hence, the largest gap that can be encoded is 2^14 − 1 = 16383 (when all 14 bits are set to 1).
I need to do the same question as above but for 3 bytes. Below is my answer but I am not sure if it is correct. Could somebody let me know if I am doing it correctly? thanks
Question:
What is the largest gap that can be encoded in 3 bytes using the variable-
byte encoding ?
My Answer:
With 3 bytes, we use 3 continuation bits, and 21 bits are available for gap encoding (2^0 to 2^20). Hence, the largest gap that can be encoded is 2^21 − 1 = 2097151 (when all 21 bits are set to 1).

With 3 bytes, we can use 3 continuation bits, and 21 bits are available for gap encoding (2^0 to 2^20). Hence, the largest gap that can be encoded is (2^21)-1 = 2097151 (when all 21 bits are set to 1).

Bits, Bytes and numbers. Shrink the size of the byte

It may be a very basic low level architecture questions. I am trying to get my head around it. Please correct if my understanding is wrong, as well.
Word = 64 bit, 32 bit, etc. This is a number of bits computer can read at a time.
Questions:
1.) Would this mean, we can send, 4 numbers (of a 8 bits/byte length each) for 32 bit? Or combination of 8 bit (byte), 32 bit (4 bytes), etc numbers at one time?
2.) If we need to send only 8 bit number, then how does it form a word? Only first byte is filled and rest all bytes are padded with 0s or last byte gets filled while rest of the bytes are padded with 0s? Or I saw somewhere like first byte has information as to how the rest of the bytes are filled. Does that apply here? For example, UTF-8. Here, ASCII is 1 byte, and some other chars take up to 4 bytes. So when we send one char, we send all 4 bytes together, but fill the bytes as required for the char and rest of the bytes 0s?
3.) Now to represent 8 digit number, we would need 27 bits (remember famous question, sorting 1 million 8 digit number with just 1 MB RAM). Can we exactly use 27 bits, which is 32 bits (4 bytes) - 5 bits? and use those 5 digits for something else?
Appreciate your answers!

1- Yes, four 8-bit integers can fit in a 32-bit integer. This can be done using bitwise operations, for example (using C operators):
((a & 255) << 24) | ((b & 255) << 16) | ((c & 255) << 8) | (d & 255)
This example uses C operators, but they are also used for the same purpose in several other languages (see below - a complete, compilable version of this example in C). You may want to look up the bitwise operators AND (&), OR (|), and Left Shift (<<);
2- Unused bits are generally 0. The first byte is sometimes used to represent the type of encoding (Look up "Magic Numbers"), but this is implementation dependent. Sometimes it is a different number of bits.
3- Groups of 8-digit numbers can be compressed to use only 27 bits each. This is very similar to the example, except the number of bits and size of the data are different. To do this, you will need 864-bit groups, i.e. 27 32-bit integers to store 32 27-bit numbers. This would be more complex than the example, but it would use the same principles.
Complete, compilable example in C:
#include <stdio.h>
/*Compresses four integers containing one byte of data in the least
*significant byte into a single 32-bit integer*/
__int32 compress(int a, int b, int c, int d){
__int32 compressed = ((a & 255) << 24) | ((b & 255) << 16) |
((c & 255) << 8) | (d & 255);
return compressed;
}
/*Test the compress() function and print the resuts*/
int main(){
printf("%x\n", (unsigned)compress(255, 0, 255, 0));
printf("%x\n", (unsigned)compress(192, 168, 0, 255));
printf("%x\n", (unsigned)compress(84, 94, 255, 2));
return 0;
}

I think that clarification on 2 points is required here :
1. Memory addressing.
2. Word
Memories can be addressed in 2 ways, they are generally either byte addressable or word addressable.
Byte addressable memory means that each byte is given a separate address.
a -> 0th byte
b -> 1st byte
Word addressable memories are those in which each group of bytes that is as wide as the word gets an address. Eg if the Word Length is 32 bits :
a->0th byte
b->4th byte
And so on.
Word
I would say that a word defines the maximum number of bits a processor can handle at a time. For 8086, for eg, it's 16.
It is usually the largest number on which the arithmetic can be performed by the processor. Continuing the example , 8086 can perform operations on 16 bit numbers at a time.
Now i'll try and answer the questions :
1.) Would this mean, we can send, 4 numbers (of a 8 bits/byte length each) for 32 bit? Or combination of 8 bit (byte), 32 bit (4 bytes),
etc numbers at one time?
You can always define your own interpretation for a bunch of bits.
For eg, If it is byte addressable, we can treat every byte individually and thus , we can write code at assemble level that treats each byte as a separate 8 bit number.
If it is not, you can use bit operations to extract individual bytes out.
The point is you can represent 4 8 bit numbers in 32 bits.
2) Mostly, leftover significant bits are stuffed with 0s ( for unsigned numbers)
3.) Now to represent 8 digit number, we would need 27 bits (remember famous question, sorting 1 million 8 digit number with just 1 MB RAM).
Can we exactly use 27 bits, which is 32 bits (4 bytes) - 5 bits? and
use those 5 digits for something else?
Yes, you can do this also. But you know the great space-time tradeoff.
You sure save 5 bits, per number. But you'll need to use bit operations and all the really cool but hard to read stuff. Shooting up time and making code more complex.
But i don't think you'll ever come across a situation where you need such level of saving, unless you are coding for a very constrained system. (embedded etc)

MD5 is 128 bits but why is it 32 characters?

I read some docs about md5, it said that its 128 bits, but why is it 32 characters? I can't compute the characters.
1 byte is 8 bits
if 1 character is 1 byte
then 128 bits is 128/8 = 16 bytes right?
EDIT:
SHA-1 produces 160 bits, so how many characters are there?

32 chars as hexdecimal representation, thats 2 chars per byte.

I wanted summerize some of the answers into one post.
First, don't think of the MD5 hash as a character string but as a hex number. Therefore, each digit is a hex digit (0-15 or 0-F) and represents four bits, not eight.
Taking that further, one byte or eight bits are represented by two hex digits, e.g. b'1111 1111' = 0xFF = 255.
MD5 hashes are 128 bits in length and generally represented by 32 hex digits.
SHA-1 hashes are 160 bits in length and generally represented by 40 hex digits.
For the SHA-2 family, I think the hash length can be one of a pre-determined set. So SHA-512 can be represented by 128 hex digits.
Again, this post is just based on previous answers.

A hex "character" (nibble) is different from a "character"
To be clear on the bits vs byte, vs characters.
1 byte is 8 bits (for our purposes)
8 bits provides 2**8 possible combinations: 256 combinations
When you look at a hex character,
16 combinations of [0-9] + [a-f]: the full range of 0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,f
16 is less than 256, so one one hex character does not store a byte.
16 is 2**4: that means one hex character can store 4 bits in a byte (half a byte).
Therefore, two hex characters, can store 8 bits, 2**8 combinations.
A byte represented as a hex character is [0-9a-f][0-9a-f] and that represents both halfs of a byte (we call a half-byte a nibble).
When you look at a regular single-byte character, (we're totally going to skip multi-byte and wide-characters here)
It can store far more than 16 combinations.
The capabilities of the character are determined by the encoding. For instance, the ISO 8859-1 that stores an entire byte, stores all this stuff
All that stuff takes the entire 2**8 range.
If a hex-character in an md5() could store all that, you'd see all the lowercase letters, all the uppercase letters, all the punctuation and things like ¡°ÀÐàð, whitespace like (newlines, and tabs), and control characters (which you can't even see and many of which aren't in use).
So they're clearly different and I hope that provides the best break down of the differences.

MD5 yields hexadecimal digits (0-15 / 0-F), so they are four bits each. 128 / 4 = 32 characters.
SHA-1 yields hexadecimal digits too (0-15 / 0-F), so 160 / 4 = 40 characters.
(Since they're mathematical operations, most hashing functions' output is commonly represented as hex digits.)
You were probably thinking of ASCII text characters, which are 8 bits.

One hex digit = 1 nibble (four-bits)
Two hex digits = 1 byte (eight-bits)
MD5 = 32 hex digits
32 hex digits = 16 bytes ( 32 / 2)
16 bytes = 128 bits (16 * 8)
The same applies to SHA-1 except it's 40 hex digits long.
I hope this helps.

That's 32 hex characters - 1 hex character is 4 bits.

Those are hexidecimal digits, not characters. One digit = 4 bits.

They're not actually characters, they're hexadecimal digits.

For clear understanding, copy the MD5 calculated 128 bit hash value in the Binary to Hex convertor and see the length of the Hex value. You will get 32 characters Hex characters.

If UTF-8 is an 8-bit encoding, why does it need 1-4 bytes?

On the Unicode site it's written that UTF-8 can be represented by 1-4 bytes. As I understand from this question https://softwareengineering.stackexchange.com/questions/77758/why-are-there-multiple-unicode-encodings UTF-8 is an 8-bits encoding.
So, what's the truth?
If it's 8-bits encoding, then what's the difference between ASCII and UTF-8?
If it's not, then why is it called UTF-8 and why do we need UTF-16 and others if they occupy the same memory?

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky - Wednesday, October 08, 2003
Excerpt from above:
Thus was invented the brilliant concept of UTF-8. UTF-8 was another system for storing your string of Unicode code points, those magic U+ numbers, in memory using 8 bit bytes. In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.
This has the neat side effect that English text looks exactly the same in UTF-8 as it did in ASCII, so Americans don't even notice anything wrong. Only the rest of the world has to jump through hoops. Specifically, Hello, which was U+0048 U+0065 U+006C U+006C U+006F, will be stored as 48 65 6C 6C 6F, which, behold! is the same as it was stored in ASCII, and ANSI, and every OEM character set on the planet. Now, if you are so bold as to use accented letters or Greek letters or Klingon letters, you'll have to use several bytes to store a single code point, but the Americans will never notice. (UTF-8 also has the nice property that ignorant old string-processing code that wants to use a single 0 byte as the null-terminator will not truncate strings).
So far I've told you three ways of encoding Unicode. The traditional store-it-in-two-byte methods are called UCS-2 (because it has two bytes) or UTF-16 (because it has 16 bits), and you still have to figure out if it's high-endian UCS-2 or low-endian UCS-2. And there's the popular new UTF-8 standard which has the nice property of also working respectably if you have the happy coincidence of English text and braindead programs that are completely unaware that there is anything other than ASCII.
There are actually a bunch of other ways of encoding Unicode. There's something called UTF-7, which is a lot like UTF-8 but guarantees that the high bit will always be zero, so that if you have to pass Unicode through some kind of draconian police-state email system that thinks 7 bits are quite enough, thank you it can still squeeze through unscathed. There's UCS-4, which stores each code point in 4 bytes, which has the nice property that every single code point can be stored in the same number of bytes, but, golly, even the Texans wouldn't be so bold as to waste that much memory.
And in fact now that you're thinking of things in terms of platonic ideal letters which are represented by Unicode code points, those unicode code points can be encoded in any old-school encoding scheme, too! For example, you could encode the Unicode string for Hello (U+0048 U+0065 U+006C U+006C U+006F) in ASCII, or the old OEM Greek Encoding, or the Hebrew ANSI Encoding, or any of several hundred encodings that have been invented so far, with one catch: some of the letters might not show up! If there's no equivalent for the Unicode code point you're trying to represent in the encoding you're trying to represent it in, you usually get a little question mark: ? or, if you're really good, a box. Which did you get? -> �
There are hundreds of traditional encodings which can only store some code points correctly and change all the other code points into question marks. Some popular encodings of English text are Windows-1252 (the Windows 9x standard for Western European languages) and ISO-8859-1, aka Latin-1 (also useful for any Western European language). But try to store Russian or Hebrew letters in these encodings and you get a bunch of question marks. UTF 7, 8, 16, and 32 all have the nice property of being able to store any code point correctly.

UTF-8 is an 8-bit variable width encoding. The first 128 characters in the Unicode, when represented with UTF-8 encoding have the representation as the characters in ASCII.
To understand this further, Unicode treats characters as codepoints - a mere number that can be represented in multiple ways (the encodings). UTF-8 is one such encoding. It is most commonly used, for it gives the best space consumption characteristics among all encodings. If you are storing characters from the ASCII character set in UTF-8 encoding, then the UTF-8 encoded data will take the same amount of space. This allowed for applications that previously used ASCII to seamlessly move (well, not quite, but it certainly didn't result in something like Y2K) to Unicode, for the character representations are the same.
I'll leave this extract here from RFC 3629, on how the UTF-8 encoding would work:
Char. number range | UTF-8 octet sequence
(hexadecimal) | (binary)
--------------------+---------------------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
You'll notice why the encoding will result in characters occupying anywhere between 1 and 4 bytes (the right-hand column) for different ranges of characters in Unicode (the left-hand column).
UTF-16, UTF-32, UCS-2 etc. will employ different encoding schemes where the codepoints would represented as 16-bit or 32-bit codes, instead of 8-bit codes that UTF-8 does.

The '8-bit' encoding means that the individual bytes of the encoding use 8 bits. In contrast, pure ASCII is a 7-bit encoding as it only has code points 0-127. It used to be that software had problems with 8-bit encodings; one of the reasons for Base-64 and uuencode encodings was to get binary data through email systems that did not handle 8-bit encodings. However, it's been a decade or more since that ceased to be allowable as a problem - software has had to be 8-bit clean, or capable of handling 8-bit encodings.
Unicode itself is a 21-bit character set. There are a number of encodings for it:
UTF-32 where each Unicode code point is stored in a 32-bit integer
UTF-16 where many Unicode code points are stored in a single 16-bit integer, but some need two 16-bit integers (so it needs 2 or 4 bytes per Unicode code point).
UTF-8 where Unicode code points can require 1, 2, 3 or 4 bytes to store a single Unicode code point.
So, "UTF-8 can be represented by 1-4 bytes" is probably not the most appropriate way of phrasing it. "Unicode code points can be represented by 1-4 bytes in UTF-8" would be more appropriate.

Just complementing the other answer about UTF-8 coding, that uses 1 to 4 bytes
As people said above, a code with 4 bytes totals 32 bits, but of these 32 bits, 11 bits are used as a prefix in the control bytes, i.e. to identify the code size of a Unicode symbol between 1 and 4 bytes and also enable to recover a text easily even in the middle of the text.
The gold question is: Why we need so much bits (11) for control in a 32 bits code? Wouldn't it be useful to have more than 21 bits for codification?
The point is that the planned scheme needs to be such that it is easily known to go back to the 1st. bite of a code.
Thus, bytes besides the first byte cannot have all their bits released for codify a Unicode symbol because otherwise they could easily to be confused as the first byte of a valid code UTF-8.
So the model is
0UUUUUUU for 1 byte code. We have 7 Us, so there are 2^7 = 128
possibilities that are the traditional ASCII codes.
110UUUUU 10UUUUUU for 2 bytes code. Here we have 11 Us so there
are 2^11 = 2,048 - 128 = 1,921 possibilities. It discounts the previous
gross number 2^7 because you need to discount the codes up to 2^7 = 127, corresponding to the 1 byte legacy ASCII.
1110UUUU 10UUUUUU 10UUUUUU for 3 bytes code. Here we have 16 Us so
there are 2^16 = 65,536 - 2,048 = 63,488 possibilities)
11110UUU 10UUUUUU 10UUUUUU 10UUUUUU for 4 bytes code. Here we have 21
Us so there are 2^21 = 2,097,152 - 65,536 = 2,031,616 possibilities,
where U is a bit 0 or 1 used to codify a Unicode UTF-8 symbol.
So the total possibilities are 127 + 1,921 + 63,488 + 2,031,616 = 2,097,152 Unicode symbols.
In the Unicode tables available (for example, in the Unicode Pad App for Android or here) appear the Unicode code in form (U+H), where H is a hex number of 1 to 6 digits. For example U+1F680 represents a rocket icon: 🚀.
This code translates the bits U of the right to left symbol code (21 to 4 bytes, 16 to 3 bytes, 11 to 2 bytes and 7 to 1 byte), grouped in bytes, and with the incomplete byte on the left completed with 0s.
Below we will try to explain why one needs to have 11 bits of control. Part of the choices made was merely a random choice between 0 and 1, which lacks a rational explanation.
As 0 is used to indicate one byte code, what makes 0 .... always equivalent to the ASCII code of 128 characters (backwards compatibility)
For symbols that uses more than 1 byte, the 10 in the start of 2nd., 3rd. and 4th. byte always serves to know we are in the middle of a code.
To settle confusion, if the first byte starts with 11, it indicates that the 1st. byte represents a Unicode character with 2, 3 or 4 bytes code. On the other hand, 10 represents a middle byte, that is, it never initiates the codification of a Unicode symbol.(Obviously the prefix for continuation bytes could not be 1 because 0... and 1... would exhaust all possible bytes)
If there were no rules for non-initial byte, it would be very ambiguous.
With this choice, we know that the first initial byte bit starts with 0 or 11, which never gets confused with a middle byte, which starts with 10. Just looking at byte we already know if it is a character ASCII, the beginning of a byte sequence (2, 3 or 4 bytes) or the byte from the middle of a byte sequence (2, 3 or 4 bytes).
It could be the opposite choice: The prefix 11 could indicate the middle byte and the prefix 10 the start byte in a code with 2, 3 or 4 bytes. That choice is just a matter of convention.
Also for matter of choice, the 3rd. bit 0 of the 1st. byte means 2 bytes UTF-8 code and the 3rd. bit 1 of the 1st. byte means 3 or 4 bytes UTF-8 code (again, it's impossible adopt prefix '11' for 2 bytes symbol, it also would exhaust all possible bytes: 0..., 10... and 11...).
So a 4th bit is required in the 1st. byte to distinguish 3 ou 4 bytes Unicode UTF-8 codification.
A 4th bit with 0 is for 3 bytes code and 1 is for 4 bytes code, which still uses an additional bit 0 that would be needless at first.
One of the reasons, beyond the pretty symmetry (0 is always the last prefix bit in the starting byte), for having the additional 0 as 5th bit in the first byte for the 4 bytes Unicode symbol, is in order to make an unknown string almost recognizable as UTF-8 because there is no byte in the range from 11111000 to 11111111 (F8 to FF or 248 to 255).
If hypothetically we use 22 bits (Using the last 0 of 5 bits in the first byte as part of character code that uses 4 bytes, there would be 2^22 = 4,194,304 possibilities in total (22 because there would be 4 + 6 + 6 + 6 = 22 bits left for UTF-8 symbol codification and 4 + 2 + 2 + 2 = 10 bits as prefix)
With adopted UTF-8 coding system (5th bit is fixed with 0 for 4 bytes code) , there are 2^21 = 2,097,152 possibilities, but only 1,112,064 of these are valid Unicodes symbols (21 because there are 3 + 6 + 6 + 6 = 21 bits left for UTF-8 symbol codification and 5 + 2 + 2 + 2 = 11 bits as prefix)
As we have seen, not all possibilities with 21 bits are used (2,097,152). Far from it (just 1,112,064). So saving one bit doesn't bring tangible benefits.
Other reason is the possibility of using this unused codes for control functions, outside Unicode world.

Would it be possible to have a UTF-8-like encoding limited to 3 bytes per character?

UTF-8 requires 4 bytes to represent characters outside the BMP. That's not bad; it's no worse than UTF-16 or UTF-32. But it's not optimal (in terms of storage space).
There are 13 bytes (C0-C1 and F5-FF) that are never used. And multi-byte sequences that are not used such as the ones corresponding to "overlong" encodings. If these had been available to encode characters, then more of them could have been represented by 2-byte or 3-byte sequences (of course, at the expense of making the implementation more complex).
Would it be possible to represent all 1,114,112 Unicode code points by a UTF-8-like encoding with at most 3 bytes per character? If not, what is the maximum number of characters such an encoding could represent?
By "UTF-8-like", I mean, at minimum:
The bytes 0x00-0x7F are reserved for ASCII characters.
Byte-oriented find / index functions work correctly. You can't find a false positive by starting in the middle of a character like you can in Shift-JIS.
Update -- My first attempt to answer the question
Suppose you have a UTF-8-style classification of leading/trailing bytes. Let:
A = the number of single-byte characters
B = the number of values used for leading bytes of 2-byte characters
C = the number of values used for leading bytes of 3-byte characters
T = 256 - (A + B + C) = the number of values used for trailing bytes
Then the number of characters that can be supported is N = A + BT + CT².
Given A = 128, the optimum is at B = 0 and C = 43. This allows 310,803 characters, or about 28% of the Unicode code space.
Is there a different approach that could encode more characters?

It would take a little over 20 bits to record all the Unicode code points (assuming your number is correct), leaving over 3 bits out of 24 for encoding which byte is which. That should be adequate.
I fail to see what you would gain by this, compared to what you would lose by not going with an established standard.
Edit: Reading the spec again, you want the values 0x00 through 0x7f reserved for the first 128 code points. That means you only have 21 bits in 3 bytes to encode the remaining 1,113,984 code points. 21 bits is barely enough, but it doesn't really give you enough extra to do the encoding unambiguously. Or at least I haven't figured out a way, so I'm changing my answer.
As to your motivations, there's certainly nothing wrong with being curious and engaging in a little thought exercise. But the point of a thought exercise is to do it yourself, not try to get the entire internet to do it for you! At least be up front about it when asking your question.

I did the math, and it's not possible (if wanting to stay strictly "UTF-8-like").
To start off, the four-byte range of UTF-8 covers U+010000 to U+10FFFF, which is a huge slice of the available characters. This is what we're trying to replace using only 3 bytes.
By special-casing each of the 13 unused prefix bytes you mention, you could gain 65,536 characters each, which brings us to a total of 13 * 0x10000, or 0xD0000.
This would bring the total 3-byte character range to U+010000 to U+0DFFFF, which is almost all, but not quite enough.

Sure it's possible. Proof:
224 = 16,777,216
So there is enough of a bit-space for 1,114,112 characters but the more crowded the bit-space the more bits are used per character. The whole point of UTF-8 is that it makes the assumption that the lower code points are far more likely in a character stream so the entire thing will be quite efficient even though some characters may use 4 bytes.
Assume 0-127 remains one byte. That leaves 8.4M spaces for 1.1M characters. You can then solve this is an equation. Choose an encoding scheme where the first byte determines how many bytes are used. So there are 128 values. Each of these will represent either 256 characters (2 bytes total) or 65,536 characters (3 bytes total). So:
256x + 65536(128-x) = 1114112 - 128
Solving this you need 111 values of the first byte as 2 byte characters and the remaining 17 as 3 byte. To check:
128 + 111 * 256 + 17 * 65536 = 1,114,256
To put it another way:
128 code points require 1 byte;
28,416 code points require 2 bytes; and
1,114,112 code points require 3 bytes.
Of course, this doesn't allow for the inevitable expansion of Unicode, which UTF-8 does. You can adjust this to the first byte meaning:
0-127 (128) = 1 byte;
128-191 (64) = 2 bytes;
192-255 (64) = 3 bytes.
This would be better because it's simple bitwise AND tests to determine length and gives an address space of 4,210,816 code points.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse