Cay anyone show me the calculation how utf-8 represents 1112064 characters? - unicode

I am not understanding how UTF-8 represents 1112064 characters.
My calculation is something like this: 27 + 211 + 216 + 221 = 2164864 characters.
To represent any character in UTF-8, for 1 byte it has 7 bits, for 2 bytes it has 11 bits, for 3 bytes it has 16 bits, and for 4 bytes it has 21 bits.
Is the number 1112064 without Emojis?

1112064 is the number of valid Unicode code points. It consists of 17 regions of 65536 code points, U+NN0000..U+NNFFFF where NN is 0x00 (the BMP, or Basic Multilingual Plane) through 0x10, less the reserved 2048 code points used for surrogates in the UTF-16 encoding, U+D800..U+DFFF.
17 x 65536 - 2048 = 1112064
UTF-8 can represent more than that, but the specification restricts valid UTF-8 to only valid Unicode code points, encoded in the shortest representation, e.g. U+0000 can be encoded as 1-byte 0x00 and also 2-byte 0xC0 0x80, but the latter is invalid, as well as 3-byte and greater versions.

4 ^ 8 + 4 ^ 10 = 1,114,112
4 ^ 8 + 4 ^ 10 - ( 4 ^ 4 ) * 8 = 1,112,064
4 ^ 8 + 4 ^ (5 * 2) - ( 4 ^ 5 ) * 2 = 1,112,064
4 ^ 8 + 4 ^ (5 * 2) - 2 ^ ( 5 + 2 + 4 ) = 1,112,064
4 ^ 8 + 4 ^ (5 * 2) - 2 ^ ( -5 + 2 ^ 4 ) = 1,112,064
——————————————————————————————————————————————————
B.M.P. supp- surrogates
planes
fun side notes :
4 ^ 2 ^ 5 = 16 ^ 16 = 2 ^ 64
2 ^ 4 ^ 5 = = 2 ^ 1024

Related

SQL Server LEN function reports wrong result

Let's say we have the following casting of an int number into binary value i.e
cast(120 as binary(8)) or any other int number into binary(8).
What we normally expect from len(cast(120 as binary(8))) = 8 and this is true unless we try with number 32 where select len(cast(32 as binary(8))) returns 7 !
Is this a bug of SQL Server?
Not a bug, it's how LEN works. LEN:
Returns the number of characters of the specified string expression,
excluding trailing spaces.
The definition of "trailing space" seems to differ based the datatype. For binary values, a trailing space is when the binary representation "20". In the BOL entry for LEN there is a note that reads,
Use the LEN to return the number of characters encoded into a given
string expression, and DATALENGTH to return the size in bytes for a
given string expression. These outputs may differ depending on the
data type and type of encoding used in the [value]. For more
information on storage differences between different encoding types,
see Collation and Unicode Support.
With Binary the length (LEN) is reduced by 1 for binary values that end with 20, by 2 for values that end with 2020, etc. Again, it's treating that value like a trailing space. DATALENGTH resolves this. Note this SQL:
DECLARE
#string VARCHAR(100) = '1234567 ',
#binary BINARY(8) = 32;
SELECT [Type] = 'string', [Len] = LEN(#string), [Datalength] = DATALENGTH(#string)
UNION ALL
SELECT [Type] = 'binary(8)', [Len] = LEN(#binary), [Datalength] = DATALENGTH(#binary);
Returns:
Type Len Datalength
--------- ----------- -----------
string 7 8
binary(8) 7 8
Using my rangeAB function (here) I created this query:
SELECT
N = r.RN,
Binaryvalue = CAST(r.RN AS binary(8)),
[Len] = LEN(CAST(r.RN AS binary(8))),
[DataLength] = DATALENGTH(CAST(r.RN AS binary(8)))
FROM dbo.rangeAB(0,10000,1,0) AS r
WHERE LEN(CAST(r.RN AS binary(8))) <> 8
ORDER BY N;
Note these results:
N Binaryvalue Len DataLength
-------------------- ------------------ ----------- -----------
32 0x0000000000000020 7 8
288 0x0000000000000120 7 8
544 0x0000000000000220 7 8
800 0x0000000000000320 7 8
1056 0x0000000000000420 7 8
1312 0x0000000000000520 7 8
1568 0x0000000000000620 7 8
1824 0x0000000000000720 7 8
2080 0x0000000000000820 7 8
2336 0x0000000000000920 7 8
2592 0x0000000000000A20 7 8
2848 0x0000000000000B20 7 8
3104 0x0000000000000C20 7 8
3360 0x0000000000000D20 7 8
3616 0x0000000000000E20 7 8
3872 0x0000000000000F20 7 8
4128 0x0000000000001020 7 8
4384 0x0000000000001120 7 8
4640 0x0000000000001220 7 8
4896 0x0000000000001320 7 8
5152 0x0000000000001420 7 8
5408 0x0000000000001520 7 8
5664 0x0000000000001620 7 8
5920 0x0000000000001720 7 8
6176 0x0000000000001820 7 8
6432 0x0000000000001920 7 8
6688 0x0000000000001A20 7 8
6944 0x0000000000001B20 7 8
7200 0x0000000000001C20 7 8
7456 0x0000000000001D20 7 8
7712 0x0000000000001E20 7 8
7968 0x0000000000001F20 7 8
8224 0x0000000000002020 6 8
8480 0x0000000000002120 7 8
8736 0x0000000000002220 7 8
8992 0x0000000000002320 7 8
9248 0x0000000000002420 7 8
9504 0x0000000000002520 7 8
9760 0x0000000000002620 7 8
Note how the LEN of CAST(8224 AS binary(8) is 6; because 8224 ends with 2020 which is treated like two spaces:
8224 0x0000000000002020 6 8

Hashtable open addressing handling probing cycles

I have been investigating various collision resolution techniques for hashtables implemented via open addressing. However, all the collision resolution methods I have investigated so far (linear probing, quadratic probing, double hashing) have the pitfall that there exists a probing sequence that produces a cycle whose length is less than the size of the table. This becomes problematic when you're trying to insert an element with the open addressing scheme because there are free buckets to insert an entry but they might not be reachable if they're not part of the cycle.
For example, if we're using linear probing on a table of size 12 with the linear function: H(k, i) = (h(k) + 4*i) mod 12 then a cycle would occur if a particular key hashes to 8 and all the slots 0, 4, and 8 are already filled:
H(k, 0) = 8 + 0 mod 12 = 8
H(k, 1) = 8 + 4 mod 12 = 0
H(k, 2) = 8 + 8 mod 12 = 4
H(k, 3) = 8 + 12 mod 12 = 8
H(k, 4) = 8 + 16 mod 12 = 0
H(k, 5) = 8 + 20 mod 12 = 4
H(k, 6) = 8 + 24 mod 12 = 8
...
Similar cycles can also be found with quadratic and double hashing if the probing sequence is bad, so my question then is how are cycles handled? Or do we always pick hash functions/special table sizes which do not permit cycles which are too short?

Detect number of bytes required for an arbitrary number in Scala

I'm trying to figure out the simplest way to write a function to detect the amount of bytes required for a number in Scala.
For instance the number
0 should be 0 bytes
1 should be 1 byte
127 should be 1 byte
128 should be 2 bytes
32767 should be 2 bytes
32768 should be 3 bytes
8388607 should be 3 bytes
8388608 should be 4 bytes
2147483647 should be 4 bytes
2147483648 should be 5 bytes
549755813887 should be 5 bytes
549755813888 should be 6 bytes
9223372036854775807 should be 8 bytes.
-1 should be 1 byte
-127 should be 1 bytes
-128 should be 2 bytes
-32767 should be 2 bytes
-32768 should be 3 bytes
-8388607 should be 3 bytes
-8388608 should be 4 bytes
-2147483647 should be 4 bytes
-2147483648 should be 5 bytes
-549755813887 should be 5 bytes
-549755813888 should be 6 bytes
-9223372036854775807 should be 8 bytes
is there any way to do this besides doing the math figuring out where the number is wrt 2^N?
After all the precisions in the comments, I guess the algorithm for negative numbers would be: whatever the answer for their opposite would be; and Long.MinValue is not an acceptable input value.
Therefore, I suggest:
def bytes(x: Long): Int = {
val posx = x.abs
if (posx == 0L) 0
else (64 - java.lang.Long.numberOfLeadingZeros(posx)) / 8 + 1
}
Tests needed.
As I mentioned, you're basically asking for "what's the smallest power-of-2-number larger than my number", with a bit of adjustment for the extra digit for the sign (positive or negative).
Here's my solution, although the result differs for 0 and -128, because, as Bergi commented on your question, you can't really write 0 with 0 bytes, and -128 fits in 1 byte.
import Math._
def bytes(x: Double): Int = {
val y = if (x >= 0) x + 1 else -x
ceil((log(y)/log(2) + 1)/8).toInt
}

What number format or notation is 01h?

I was reviewing this NEC document for programming a projector which shows RS232 command examples using number formats such as: 20h + 81h + 01h + 60h + 01h + 00h = 103h from other sections in the document it would seem that h = 15, though I could be wrong.
I'm a bit embarrassed to ask, but what number format is this? 20h or 103h
It's hexadecimal
20h == 0x20 == 32
I hadn't seen this kind of notation in a while. I remember it being used for old PC BIOS/DOS interrupt tables: http://spike.scu.edu.au/~barry/interrupts.html
It's hex.
20h = 32
01h = 1
Similar to the 0x notation. E.g. 0x20 = 20h = 32.
In Section 2.1 of the document you linked:
Command/response A series of strings enclosed in a frame represents a
command or response (in hexadecimal notation).
Its hexadecimal.
That means 16 instead of 10 as number base.
So the numbers are 0 1 2 3 4 5 6 7 8 9 A B C D E F.
Adding 20h + 81h equals A1h
or 32 + 129 = 161.
Other notations are 0x00 (C languages)

What's the is maximum length of scrypt output?

I'd like to store an scrypt-hashed password in a database. What is the maximum length I can expect?
According to https://github.com/wg/scrypt the output format is $s0$params$salt$key where:
s0 denotes version 0 of the format, with 128-bit salt and 256-bit derived key.
params is a 32-bit hex integer containing log2(N) (16 bits), r (8 bits), and p (8 bits).
salt is the base64-encoded salt.
key is the base64-encoded derived key.
According to https://stackoverflow.com/a/13378842/14731 the length of a base64-encoded string is where n denotes the number of bytes being encoded.
Let's break this down:
The dollar signs makes up 4 characters.
The version numbers makes up 2 characters.
Each hex character represents 4 bits ( log2(16) = 4 ), so the params field makes up (32-bit / 4 bits) = 8 characters.
The 128-bit salt is equivalent to 16 bytes. The base64-encoded format makes up (4 * ceil(16 / 3)) = 24 characters.
The 256-bit derived key is equivalent to 32 bytes. The base64-encoded format makes up (4 * ceil(32 / 3)) = 44 characters.
Putting that all together, we get: 4 + 2 + 8 + 24 + 44 = 82 characters.
In Colin Percival's own implementation, the tarsnap scrypt header is 96 bytes. This comprises:
6 bytes 'scrypt'
10 bytes N, r, p parameters
32 bytes salt
16 bytes SHA256 checksum of bytes 0-47
32 bytes HMAC hash of bytes 0-63 (using scrypt hash as key)
This is also the format used by node-scrypt. There is an explanation of the rationale behind the checksum and the HMAC hash on stackexchange.
As a base64-encoded string, this makes 128 characters.