How do you measure the size of a JWT token? It is a long string value.
I would like the token to be less than 7 kb.
(https://medium.com/dataseries/public-claims-and-how-to-validate-a-jwt-1d6c81823826)
(https://stackoverflow.com/questions/26033983/what-is-the-maximum-size-of-jwt-token#:~:text=As%20a%20JWT%20is%20included,of%20room%20for%20other%20headers.)
JWT is just 3 base64 strings, concatenated with . characters. So, unless you somehow force it into a wider character set, 1 character = 1 byte.
Total size will be a function of the signing algorithm in use, and the actual payload size. base64 has 3:4 overhead. So, your JWT will always be raw payload size * 1.25, plus signature and header. I usually just think of it as 1.5x overhead, and if you come in smaller that's a bonus.
All that said 7kb is pretty huge for something meant to be passed in an HTTP header. I don't know what the hard limit is, but practically speaking I like to stay under 1kb, and ideally under a few hundred characters.
Related
What is the purpose of padding in base64 encoding. The following is the extract from wikipedia:
"An additional pad character is allocated which may be used to force the encoded output into an integer multiple of 4 characters (or equivalently when the unencoded binary text is not a multiple of 3 bytes) ; these padding characters must then be discarded when decoding but still allow the calculation of the effective length of the unencoded text, when its input binary length would not be not a multiple of 3 bytes (the last non-pad character is normally encoded so that the last 6-bit block it represents will be zero-padded on its least significant bits, at most two pad characters may occur at the end of the encoded stream)."
I wrote a program which could base64 encode any string and decode any base64 encoded string. What problem does padding solves?
Your conclusion that padding is unnecessary is right. It's always possible to determine the length of the input unambiguously from the length of the encoded sequence.
However, padding is useful in situations where base64 encoded strings are concatenated in such a way that the lengths of the individual sequences are lost, as might happen, for example, in a very simple network protocol.
If unpadded strings are concatenated, it's impossible to recover the original data because information about the number of odd bytes at the end of each individual sequence is lost. However, if padded sequences are used, there's no ambiguity, and the sequence as a whole can be decoded correctly.
Edit: An Illustration
Suppose we have a program that base64-encodes words, concatenates them and sends them over a network. It encodes "I", "AM" and "TJM", sandwiches the results together without padding and transmits them.
I encodes to SQ (SQ== with padding)
AM encodes to QU0 (QU0= with padding)
TJM encodes to VEpN (VEpN with padding)
So the transmitted data is SQQU0VEpN. The receiver base64-decodes this as I\x04\x14\xd1Q) instead of the intended IAMTJM. The result is nonsense because the sender has destroyed information about where each word ends in the encoded sequence. If the sender had sent SQ==QU0=VEpN instead, the receiver could have decoded this as three separate base64 sequences which would concatenate to give IAMTJM.
Why Bother with Padding?
Why not just design the protocol to prefix each word with an integer length? Then the receiver could decode the stream correctly and there would be no need for padding.
That's a great idea, as long as we know the length of the data we're encoding before we start encoding it. But what if, instead of words, we were encoding chunks of video from a live camera? We might not know the length of each chunk in advance.
If the protocol used padding, there would be no need to transmit a length at all. The data could be encoded as it came in from the camera, each chunk terminated with padding, and the receiver would be able to decode the stream correctly.
Obviously that's a very contrived example, but perhaps it illustrates why padding might conceivably be helpful in some situations.
On a related note, here's an arbitrary base converter I created for you. Enjoy!
https://convert.zamicol.com/
What are Padding Characters?
Padding characters help satisfy length requirements and carry no other meaning.
Decimal Example of Padding:
Given the arbitrary requirement all strings be 8 characters in length, the number 640 can meet this requirement using preceding 0's as padding characters as they carry no meaning, "00000640".
Binary Encoding
The Byte Paradigm: For encoding, the byte is the de facto standard unit of measurement and any scheme must relate back to bytes.
Base256 fits exactly into the byte paradigm. One byte is equal to one character in base256.
Base16, hexadecimal or hex, uses 4 bits for each character. One byte can represent two base16 characters.
Base64 does not fit evenly into the byte paradigm (nor does base32), unlike base256 and base16. All base64 characters can be represented in 6 bits, 2 bits short of a full byte.
We can represent base64 encoding versus the byte paradigm as a fraction: 6 bits per character over 8 bits per byte. Reduced this fraction is 3 bytes over 4 characters.
This ratio, 3 bytes for every 4 base64 characters, is the rule we want to follow when encoding base64. Base64 encoding can only promise even measuring with 3 byte bundles, unlike base16 and base256 where every byte can stand on it's own.
So why is padding encouraged even though encoding could work just fine without the padding characters?
If the length of a stream is unknown or if it could be helpful to know exactly when a data stream ends, use padding. The padding characters communicate explicitly that those extra spots should be empty and rules out any ambiguity. Even if the length is unknown with padding you'll know where your data stream ends.
As a counter example, some standards like JOSE don't allow padding characters. In this case, if there is something missing, a cryptographic signature won't work or other non base64 characters will be missing (like the "."). Although assumptions about length aren't made, padding isn't needed because if there is something wrong it simply won't work.
And this is exactly what the base64 RFC says,
In some circumstances, the use of padding ("=") in base-encoded data
is not required or used. In the general case, when assumptions about
the size of transported data cannot be made, padding is required to
yield correct decoded data.
[...]
The padding step in base 64 [...] if improperly
implemented, lead to non-significant alterations of the encoded data.
For example, if the input is only one octet for a base 64 encoding,
then all six bits of the first symbol are used, but only the first
two bits of the next symbol are used. These pad bits MUST be set to
zero by conforming encoders, which is described in the descriptions
on padding below. If this property do not hold, there is no
canonical representation of base-encoded data, and multiple base-
encoded strings can be decoded to the same binary data. If this
property (and others discussed in this document) holds, a canonical
encoding is guaranteed.
Padding allows us to decode base64 encoding with the promise of no lost bits. Without padding there is no longer the explicit acknowledgement of measuring in three byte bundles. Without padding you may not be able to guarantee exact reproduction of original encoding without additional information usually from somewhere else in your stack, like TCP, checksums, or other methods.
Alternatively to bucket conversion schemes like base64 is radix conversion which has no arbitrary bucket sizes and for left-to-right readers is left padded. The "iterative divide by radix" conversion method is typically employed for radix conversions.
Examples
Here is the example form RFC 4648 (https://www.rfc-editor.org/rfc/rfc4648#section-8)
Each character inside the "BASE64" function uses one byte (base256). We then translate that to base64.
BASE64("") = "" (No bytes used. 0 % 3 = 0)
BASE64("f") = "Zg==" (One byte used. 1 % 3 = 1)
BASE64("fo") = "Zm8=" (Two bytes. 2 % 3 = 2)
BASE64("foo") = "Zm9v" (Three bytes. 3 % 3 = 0)
BASE64("foob") = "Zm9vYg==" (Four bytes. 4 % 3 = 1)
BASE64("fooba") = "Zm9vYmE=" (Five bytes. 5 % 3 = 2)
BASE64("foobar") = "Zm9vYmFy" (Six bytes. 6 % 3 = 0)
Here's an encoder that you can play around with: http://www.motobit.com/util/base64-decoder-encoder.asp
There is not much benefit to it in the modern day. So let's look at this as a question of what the original historical purpose may have been.
Base64 encoding makes its first appearance in RFC 1421 dated 1993. This RFC is actually focused on encrypting email, and base64 is described in one small section 4.3.2.4.
This RFC does not explain the purpose of the padding. The closest we have to a mention of the original purpose is this sentence:
A full encoding quantum is always completed at the end of a message.
It does not suggest concatenation (top answer here), nor ease of implementation as an explicit purpose for the padding. However, considering the entire description, it is not unreasonable to assume that this may have been intended to help the decoder read the input in 32-bit units ("quanta"). That is of no benefit today, however in 1993 unsafe C code would have very likely actually taken advantage of this property.
With padding, a base64 string always has a length that is a multiple of 4 (if it doesn't, the string has been corrupted for sure) and thus code can easily process that string in a loop that processes 4 characters at a time (always converting 4 input characters to three or less output bytes). So padding makes sanity checking easy (length % 4 != 0 ==> error as not possible with padding) and it makes processing simpler and more efficient.
I know what people will think: Even without padding, I can process all 4-byte chunks in a loop and then just add special handling for the last 1 to 3 bytes, if those exist. It's just a few lines of extra code and the speed difference will be too tiny to even measure. Probably true but you are thinking in terms of C (or higher languages) and a powerful CPU with plenty of RAM. What if you need to decode base64 in hardware, using a simple DSP, that has very limited processing power, no RAM storage and you have to write the code in very limited micro-assembly? What if you cannot use code at all and everything has to be done with just transistors stacked together (a hardwired hardware implementation)? With padding that's way simpler than without.
Padding fills the output length to a multiple of four bytes in a defined way.
When hashing a string, like a password, with SHA-256, is there a limit to the length of the string I am hashing? For example, is it only "safe" to hash strings that are smaller than 64 characters?
There is technically a limit, but it's quite large. The padding scheme used for SHA-256 requires that the size of the input (in bits) be expressed as a 64-bit number. Therefore, the maximum size is (264-1)/8 bytes ~= 2'091'752 terabytes.
That renders the limit almost entirely theoretical, not practical.
Most people don't have the storage for nearly that much data anyway, but even if they did, processing it all serially to produce a single hash would take an amount of time most would consider prohibitive.
A quick back-of-the-envelope kind of calculation indicates that even with the fastest enterprise SSDs currently1 listed on Tom's hardware, and striping them 16 wide to improve bandwidth, just reading that quantity of data would still take about 220 years.
1. As of April 2016.
There is no such limit, other than the maximum message size of 264-1 bits. SHA2 is frequently used to generate hashes for executables, which tend to be much larger than a few dozen bytes.
The upper limit is given in the NIST Standard FIPS 180-4. The reason for the upper limit is the padding scheme to countermeasure against the MOV attack that Merkle-Damgard construction's artifact. The message length l is lastly appended to the message during padding.
Then append the 64-bit block that is equal to the number l expressed using a binary representation
Therefore by the NIST standard, the maximum file size can be hashed with SHA-256 is 2^64-1 in bits ( approx 2.305 exabytes - that is close to the lower range of the estimated NSA's data center in UTAH, so you don't need to worry).
NIST enables the hash of the size zero message. Therefore the message length starts from 0 to 2^64-1.
If you need to hash files larger than 2^64-1 then either use SHA-512 which has 2^128-1 limit or use SHA3 which has no limit.
I am having a text data in XML format and it's length is around 816814 bytes. It contains some image data as well as some text data.
We are using ZLIB algorithm for compressing and after compressing, the compressed data length is 487239 bytes.
After compressing we are encoding data using BASE64Encoder. But after encoding the compressed data, size is increasing and length of encoded data is 666748 bytes.
Why, after encoding data size is increasing? Is there any other best encoding techniques?
Regards,
Siddesh
As noted, when you are encoding binary 8-bit bytes with 256 possible values into a smaller set of characters, in this case 64 values, you will necessarily increase the size. For a set of n allowed characters, the expansion factor for random binary input will be log(256)/log(n), at a minimum.
If you would like to reduce this impact, then use more characters. Chances are that whatever medium you are using, it can handle more than 64 characters transparently. Find out how many by simply sending all 256 possible bytes, and see which ones make it through. Test the candidate set thoroughly, and then ideally find documentation of the medium that backs up that set of n < 256.
Once you have the set, then you can use a simple hard-wired arithmetic code to convert from the set of 256 to the set of n and back.
That is perfectly normal.
Base64 is required to be done, if your transmitting medium is not designed to transmit binary data but only textual data (eg XML)
So your zip file gets base64 encoded.
Plainly speaking, it requires the transcoder to change "non-ASCII" letters into a ASCII form but still remember the way to go back
As a rule of thumb, it's around a 33% size increase ( http://en.wikipedia.org/wiki/Base64#Examples )
This is the downside of base64. You are better of using a protocol supporting file-transfer... but for files encoded within XML, you are pretty much out of options.
I have a client application (iPhone) that collects 10,000 doubles. I'd like to send this double array over HTTP to an appengine server (java). I'm looking for the best way to do this.
Best can be defined as some combination of ease of programming and compactness of representation as the amount of data can be quite high.
My current idea is that I will convert the entire array of doubles to a string representation and send that as a POST parameter, on the server parse the string and convert back to a double array. Seems inefficient though...
Thanks!
I think you kind of answered your own question :) The big thing to beware of is differences between the floating point representation on the device and the server. These days they're both going to be little-endian, (mostly) IEEE-754 compliant. However, there can still be some subtle differences in implementation that might bite, e.g handling of denormals and infinities, but you can likely get away with ignoring them. I seem to recall a few of the edge cases in NEON (used in the iPhone's Cortex A-8) aren't handled the same as x86.
If you do send as a string, you'll end up with a decimal and binary conversion between, and potentially lose accuracy. This isn't that inefficient, though - it's only 10,000 numbers. Unless you're expecting thousands of devices pumping this data at your server non-stop.
If you'd like some efficiency in the wire transfer and on the device side, then one approach is to just send the doubles in their raw binary form. On the server, reparse them to a doubles (Double.longBitsToDouble). Make sure you get the endian-ness right when you grab the data as longs (it'll be fairly obvious when it's wrong).
I guess that there are lots and lots of different ways to do this. If it were me I would probably just serialize to an array of bytes and then base64 encode it, most other mechanisms will significantly increase the volume of data being passed.
10k doubles is 80k binary bytes is about 107k or so characters base64 encoded. Or 3 doubles is 24 binary bytes is 32 base64 characters. There's tons of base64 conversion example source code available.
This is far preferable to any decimal representation conversions, since the decimal conversion is slower and, worse, potentially lossy.
json
for iphone encode with yajl-obj-c
and for java read with jsonarray
If you have a working method, and you haven't identified a performance problem, then the method you have now is just fine.
Don't go trying to find a better way to do it unless you know it doesn't meet your needs.
On inspection it seems that on the java side, a double (64 bytes) will be about 4 characters (16 bytes * 4). Now, when I think of your average double, let's say 10 digits and a decimal point, plus some delimiter like a space of semicolon, you're looking at about 12 characters per decimal. That's only 3x as much.
So you originally had 80k of data, and now you have 240k of data. Is that really that much of a difference?
What is the purpose of padding in base64 encoding. The following is the extract from wikipedia:
"An additional pad character is allocated which may be used to force the encoded output into an integer multiple of 4 characters (or equivalently when the unencoded binary text is not a multiple of 3 bytes) ; these padding characters must then be discarded when decoding but still allow the calculation of the effective length of the unencoded text, when its input binary length would not be not a multiple of 3 bytes (the last non-pad character is normally encoded so that the last 6-bit block it represents will be zero-padded on its least significant bits, at most two pad characters may occur at the end of the encoded stream)."
I wrote a program which could base64 encode any string and decode any base64 encoded string. What problem does padding solves?
Your conclusion that padding is unnecessary is right. It's always possible to determine the length of the input unambiguously from the length of the encoded sequence.
However, padding is useful in situations where base64 encoded strings are concatenated in such a way that the lengths of the individual sequences are lost, as might happen, for example, in a very simple network protocol.
If unpadded strings are concatenated, it's impossible to recover the original data because information about the number of odd bytes at the end of each individual sequence is lost. However, if padded sequences are used, there's no ambiguity, and the sequence as a whole can be decoded correctly.
Edit: An Illustration
Suppose we have a program that base64-encodes words, concatenates them and sends them over a network. It encodes "I", "AM" and "TJM", sandwiches the results together without padding and transmits them.
I encodes to SQ (SQ== with padding)
AM encodes to QU0 (QU0= with padding)
TJM encodes to VEpN (VEpN with padding)
So the transmitted data is SQQU0VEpN. The receiver base64-decodes this as I\x04\x14\xd1Q) instead of the intended IAMTJM. The result is nonsense because the sender has destroyed information about where each word ends in the encoded sequence. If the sender had sent SQ==QU0=VEpN instead, the receiver could have decoded this as three separate base64 sequences which would concatenate to give IAMTJM.
Why Bother with Padding?
Why not just design the protocol to prefix each word with an integer length? Then the receiver could decode the stream correctly and there would be no need for padding.
That's a great idea, as long as we know the length of the data we're encoding before we start encoding it. But what if, instead of words, we were encoding chunks of video from a live camera? We might not know the length of each chunk in advance.
If the protocol used padding, there would be no need to transmit a length at all. The data could be encoded as it came in from the camera, each chunk terminated with padding, and the receiver would be able to decode the stream correctly.
Obviously that's a very contrived example, but perhaps it illustrates why padding might conceivably be helpful in some situations.
On a related note, here's an arbitrary base converter I created for you. Enjoy!
https://convert.zamicol.com/
What are Padding Characters?
Padding characters help satisfy length requirements and carry no other meaning.
Decimal Example of Padding:
Given the arbitrary requirement all strings be 8 characters in length, the number 640 can meet this requirement using preceding 0's as padding characters as they carry no meaning, "00000640".
Binary Encoding
The Byte Paradigm: For encoding, the byte is the de facto standard unit of measurement and any scheme must relate back to bytes.
Base256 fits exactly into the byte paradigm. One byte is equal to one character in base256.
Base16, hexadecimal or hex, uses 4 bits for each character. One byte can represent two base16 characters.
Base64 does not fit evenly into the byte paradigm (nor does base32), unlike base256 and base16. All base64 characters can be represented in 6 bits, 2 bits short of a full byte.
We can represent base64 encoding versus the byte paradigm as a fraction: 6 bits per character over 8 bits per byte. Reduced this fraction is 3 bytes over 4 characters.
This ratio, 3 bytes for every 4 base64 characters, is the rule we want to follow when encoding base64. Base64 encoding can only promise even measuring with 3 byte bundles, unlike base16 and base256 where every byte can stand on it's own.
So why is padding encouraged even though encoding could work just fine without the padding characters?
If the length of a stream is unknown or if it could be helpful to know exactly when a data stream ends, use padding. The padding characters communicate explicitly that those extra spots should be empty and rules out any ambiguity. Even if the length is unknown with padding you'll know where your data stream ends.
As a counter example, some standards like JOSE don't allow padding characters. In this case, if there is something missing, a cryptographic signature won't work or other non base64 characters will be missing (like the "."). Although assumptions about length aren't made, padding isn't needed because if there is something wrong it simply won't work.
And this is exactly what the base64 RFC says,
In some circumstances, the use of padding ("=") in base-encoded data
is not required or used. In the general case, when assumptions about
the size of transported data cannot be made, padding is required to
yield correct decoded data.
[...]
The padding step in base 64 [...] if improperly
implemented, lead to non-significant alterations of the encoded data.
For example, if the input is only one octet for a base 64 encoding,
then all six bits of the first symbol are used, but only the first
two bits of the next symbol are used. These pad bits MUST be set to
zero by conforming encoders, which is described in the descriptions
on padding below. If this property do not hold, there is no
canonical representation of base-encoded data, and multiple base-
encoded strings can be decoded to the same binary data. If this
property (and others discussed in this document) holds, a canonical
encoding is guaranteed.
Padding allows us to decode base64 encoding with the promise of no lost bits. Without padding there is no longer the explicit acknowledgement of measuring in three byte bundles. Without padding you may not be able to guarantee exact reproduction of original encoding without additional information usually from somewhere else in your stack, like TCP, checksums, or other methods.
Alternatively to bucket conversion schemes like base64 is radix conversion which has no arbitrary bucket sizes and for left-to-right readers is left padded. The "iterative divide by radix" conversion method is typically employed for radix conversions.
Examples
Here is the example form RFC 4648 (https://www.rfc-editor.org/rfc/rfc4648#section-8)
Each character inside the "BASE64" function uses one byte (base256). We then translate that to base64.
BASE64("") = "" (No bytes used. 0 % 3 = 0)
BASE64("f") = "Zg==" (One byte used. 1 % 3 = 1)
BASE64("fo") = "Zm8=" (Two bytes. 2 % 3 = 2)
BASE64("foo") = "Zm9v" (Three bytes. 3 % 3 = 0)
BASE64("foob") = "Zm9vYg==" (Four bytes. 4 % 3 = 1)
BASE64("fooba") = "Zm9vYmE=" (Five bytes. 5 % 3 = 2)
BASE64("foobar") = "Zm9vYmFy" (Six bytes. 6 % 3 = 0)
Here's an encoder that you can play around with: http://www.motobit.com/util/base64-decoder-encoder.asp
There is not much benefit to it in the modern day. So let's look at this as a question of what the original historical purpose may have been.
Base64 encoding makes its first appearance in RFC 1421 dated 1993. This RFC is actually focused on encrypting email, and base64 is described in one small section 4.3.2.4.
This RFC does not explain the purpose of the padding. The closest we have to a mention of the original purpose is this sentence:
A full encoding quantum is always completed at the end of a message.
It does not suggest concatenation (top answer here), nor ease of implementation as an explicit purpose for the padding. However, considering the entire description, it is not unreasonable to assume that this may have been intended to help the decoder read the input in 32-bit units ("quanta"). That is of no benefit today, however in 1993 unsafe C code would have very likely actually taken advantage of this property.
With padding, a base64 string always has a length that is a multiple of 4 (if it doesn't, the string has been corrupted for sure) and thus code can easily process that string in a loop that processes 4 characters at a time (always converting 4 input characters to three or less output bytes). So padding makes sanity checking easy (length % 4 != 0 ==> error as not possible with padding) and it makes processing simpler and more efficient.
I know what people will think: Even without padding, I can process all 4-byte chunks in a loop and then just add special handling for the last 1 to 3 bytes, if those exist. It's just a few lines of extra code and the speed difference will be too tiny to even measure. Probably true but you are thinking in terms of C (or higher languages) and a powerful CPU with plenty of RAM. What if you need to decode base64 in hardware, using a simple DSP, that has very limited processing power, no RAM storage and you have to write the code in very limited micro-assembly? What if you cannot use code at all and everything has to be done with just transistors stacked together (a hardwired hardware implementation)? With padding that's way simpler than without.
Padding fills the output length to a multiple of four bytes in a defined way.