Why are there limits on domain name lengths? - hash

From what I know, domains seem to be keys into a hash of the DNS where the value is the resource records for the domain name. Why are they limited in length? The specifications I found say that a domain name:
+Has a maximum label length of 63 characters
+Has a maximum of 127 labels.
+Cannot be no more than 255 bytes of data
And theres also all sorts of restrictions upon special character ordering, etc. Why is that?

label length
The 63-byte limit is because in the DNS protocol, labels stored as , length is a single byte, but two high bits of the length field reserved for something else (compression) thus leaving 6 bits for the length itself, 2^6=64 possible values - 0..63.
To simplify implementations, the total length of a domain name (i.e.,
label octets and label length octets) is restricted to 255 octets or
less.
I did not find a limit for 127 labels in the specifications. It arises simply from the fact that the whole domain name is up to 255 bytes and label is always no less than 2 bytes (single letter and the dot or length and the letter).

Related

Memory blocks and tags

Suppose that we have a cpu with cache that consists of 128 blocks. 8 bytes of memory can be saved to each block.How can I find which block each address belongs to? Also what is each address' tag?
The following is my way of thinking.
Take the 32bit address 1030 for example. If I do 1030 * 4 = 4120 I have the address in a byte format. Then I turn it in a 8byte format 4120 / 8 = 515.
Then I do 515 % 128 = 3 which is (8byte address)%(number of blocks) to find the block that this address is on (block no.3).
Then I do 515 / 128 = 4 to find the possition that the address is on block no.3. So tag = 4.
Is my way of thinking correct?
Any comment is welcomed!
What we know generically:
A cache decomposes addresses into fields, namely: a tag field, an index field, and a block offset field.  For any given cache the field sizes are fixed, and, knowing their width (number of bits) allows us decompose an address the same way that cache does.
An address as a simple number:
+---------------------------+
| address |
+---------------------------+
We would view addresses as unsigned integers, and the number of bits used for the address is the address space size.  As decomposed into fields by the cache:
+----------------------------+
| tag | index | offset |
+----------------------------+
Each field uses an integer number of bits for its width.
What we know from your problem statement:
the block size is 8 bytes, therefore
the block offset field width is log2( block size in bytes )
the address space (total number of bit in an address) is 32 bits, therefore
tag width + index width + offset width = 32
Since information about associativity is not given we should assume the cache is direct mapped.  No information to the contrary is provided, and direct mapped caches are common early in coursework.  I'd verify or else state the assumption explicitly of direct mapped cache.
there are 128 blocks, therefore, for a direct mapped cache
there are 128 index positions in the cache array.
(for 2- way or 4- way we would divide by 2 or 4, respectively)
Given 128 index positions in the cache array
the index field width is log2( number of index positions )
Knowing the index field width, the block offset field width, and total address width, we can compute the tag field width
tag field width = 32 - index field width - block offset field width
Only when you have such field widths does it make sense to attempt to decode a given address and extract the fields' actual values for that address.
Because there are three fields, the preferred approach to extraction is to simply write out the address in binary and group the bits according to the fields and their widths.
(Division and modulus can be made to work but with (a) 3 fields, and (b) the index field being in the middle using math there is a arguable more complex, since to get the index we have to divide (to remove the block offset) and modulus (to remove the tag bits), but this is equivalent to the other approach.)
Comments on your reasoning:
You need to know if 1030 is in decimal or hex.  It is unusual to write an addresses in decimal notation, since hex notation converts into binary notation (and hence the various bit fields) so much easier.  (Some educational computers use decimal notation for addresses, but they generally have a much smaller address space, like 3 decimal digits, and certainly not a 32-bit address space.)
Take the 32bit address 1030 for example. If I do 1030 * 4 = 4120 I have the address in a byte format.
Unless something is really out of the ordinary, the address 1030 is already in byte format — so don't do that.
Then I turn it in a 8byte format 4120 / 8 = 515.
The 8 bytes of the cache make up the block offset field for decoding an address.  Need to decode the address into 3 fields, not necessarily divide it.
Again the key is to first compute the block size, then the index size, then the tag size.  Take a given address, convert to binary, and group the bits to know the tag, index, and block offset values in binary (then maybe convert those values to hex (or decimal if you must)).

What is the maximum size of symbol data type in KDB+?

I cannot find the maximum size of the symbol data type in KDB+.
Does anyone know what it is?
If youa re talking the physical length of a symbol, well symbols exist as interred strings in kdb, so the maximum string length limit would apply. As strings are just a list of characters in kdb, the maximum size of a string would be the maximum length of a list. In 3.x this would be 264 - 1, In previous versions of kdb this limit was 2,000,000,000.
However there is a 2TB maximum serialized size limit that would likely kick in first, you can roughly work out the size of a sym by serializing it,
q)count -8!`
10
q)count -8!`a
11
q)count -8!`abc
13
So each character adds a single byte, this would give a roughly 1012 character length size limit
If you mean the maximum amount of symbols that can exist in memory, then the limit is 1.4B.

Maximum input and output length for Argon2

As you may know, maximum input length for bcrypt is 72 characters and the output length is 60 characters. (I've it tested in PHP. Correct me if I'm wrong)
I want to know maximum input length and the exact output length for argon2. Thanks.
According to https://en.wikipedia.org/wiki/Argon2#Algorithm max input length is 2^32-1 bytes or 4294967295 bytes.
As to the equivalent in character length, it depends on what character encoding you use.
According to this answer:
In ASCII or ISO 8859, each character is represented by one byte
In UTF-32, each character is represented by 4 bytes
In UTF-8, each character uses between 1 and 4 bytes
In ISO 2022, it's much more complicated
Still according to https://en.wikipedia.org/wiki/Argon2#Algorithm I cannot give you an 'exact' output length because it depends on the length you choose for various parameters such as the salt and the output hash itself.

Should I use _0, _1, _2... as Strings or 0, 1, 2... as Numbers for _id values of a constant list of things?

Initially, I had documents in the form of: resources:{wood:123, coal:1, silver:5} and boxes:{wood:999, coal:20}. In this example, my server's code tests (quite efficiently) if there is enough space for the wood (and it is) and enough space for the coal (and it is) and enough space for the silver (there is not, if space is 0 I don't even include it in boxes) then all is well.
I want to shorten the _id value from wood, coal, silver to a numeric representation which in turns occupies less space, packets of information are smaller when communicating to and from the client / server, etc.
I am curious about using 0, 1, 2...as numbers for the _id or _0, _1, _2...
What are the advantages of using Number or String? Are Numbers faster for queries? (ignoring index speed).
I am adding these values manually btw :P
The number of bytes necessary to represent an integer can be found by taking the integer and dividing by 256. The number of bytes necessary to represent a string are the number of characters in the string.
It takes only one byte to represent the numbers 87 or 202, but it takes two and three bytes to represent the same as a string (plus one more if you use the underscore).
Integers are almost certainly what you want here. However, if you're concerned about over-the-wire size, then you might see gains by shortening your keys. Rather than using wood, coal, and silver, you could use w, c, and s, saving you 11 bytes per record pulled.

Would it be possible to have a UTF-8-like encoding limited to 3 bytes per character?

UTF-8 requires 4 bytes to represent characters outside the BMP. That's not bad; it's no worse than UTF-16 or UTF-32. But it's not optimal (in terms of storage space).
There are 13 bytes (C0-C1 and F5-FF) that are never used. And multi-byte sequences that are not used such as the ones corresponding to "overlong" encodings. If these had been available to encode characters, then more of them could have been represented by 2-byte or 3-byte sequences (of course, at the expense of making the implementation more complex).
Would it be possible to represent all 1,114,112 Unicode code points by a UTF-8-like encoding with at most 3 bytes per character? If not, what is the maximum number of characters such an encoding could represent?
By "UTF-8-like", I mean, at minimum:
The bytes 0x00-0x7F are reserved for ASCII characters.
Byte-oriented find / index functions work correctly. You can't find a false positive by starting in the middle of a character like you can in Shift-JIS.
Update -- My first attempt to answer the question
Suppose you have a UTF-8-style classification of leading/trailing bytes. Let:
A = the number of single-byte characters
B = the number of values used for leading bytes of 2-byte characters
C = the number of values used for leading bytes of 3-byte characters
T = 256 - (A + B + C) = the number of values used for trailing bytes
Then the number of characters that can be supported is N = A + BT + CT².
Given A = 128, the optimum is at B = 0 and C = 43. This allows 310,803 characters, or about 28% of the Unicode code space.
Is there a different approach that could encode more characters?
It would take a little over 20 bits to record all the Unicode code points (assuming your number is correct), leaving over 3 bits out of 24 for encoding which byte is which. That should be adequate.
I fail to see what you would gain by this, compared to what you would lose by not going with an established standard.
Edit: Reading the spec again, you want the values 0x00 through 0x7f reserved for the first 128 code points. That means you only have 21 bits in 3 bytes to encode the remaining 1,113,984 code points. 21 bits is barely enough, but it doesn't really give you enough extra to do the encoding unambiguously. Or at least I haven't figured out a way, so I'm changing my answer.
As to your motivations, there's certainly nothing wrong with being curious and engaging in a little thought exercise. But the point of a thought exercise is to do it yourself, not try to get the entire internet to do it for you! At least be up front about it when asking your question.
I did the math, and it's not possible (if wanting to stay strictly "UTF-8-like").
To start off, the four-byte range of UTF-8 covers U+010000 to U+10FFFF, which is a huge slice of the available characters. This is what we're trying to replace using only 3 bytes.
By special-casing each of the 13 unused prefix bytes you mention, you could gain 65,536 characters each, which brings us to a total of 13 * 0x10000, or 0xD0000.
This would bring the total 3-byte character range to U+010000 to U+0DFFFF, which is almost all, but not quite enough.
Sure it's possible. Proof:
224 = 16,777,216
So there is enough of a bit-space for 1,114,112 characters but the more crowded the bit-space the more bits are used per character. The whole point of UTF-8 is that it makes the assumption that the lower code points are far more likely in a character stream so the entire thing will be quite efficient even though some characters may use 4 bytes.
Assume 0-127 remains one byte. That leaves 8.4M spaces for 1.1M characters. You can then solve this is an equation. Choose an encoding scheme where the first byte determines how many bytes are used. So there are 128 values. Each of these will represent either 256 characters (2 bytes total) or 65,536 characters (3 bytes total). So:
256x + 65536(128-x) = 1114112 - 128
Solving this you need 111 values of the first byte as 2 byte characters and the remaining 17 as 3 byte. To check:
128 + 111 * 256 + 17 * 65536 = 1,114,256
To put it another way:
128 code points require 1 byte;
28,416 code points require 2 bytes; and
1,114,112 code points require 3 bytes.
Of course, this doesn't allow for the inevitable expansion of Unicode, which UTF-8 does. You can adjust this to the first byte meaning:
0-127 (128) = 1 byte;
128-191 (64) = 2 bytes;
192-255 (64) = 3 bytes.
This would be better because it's simple bitwise AND tests to determine length and gives an address space of 4,210,816 code points.