what characters are allowed in a JWT token? - jwt

I saw JWT token consists of A-Z,a-Z,0-9 and special characters - and _. I want to know the list of characters that are allowed in a JWT token?

From the JWT introduction: “The output is three Base64-URL strings separated by dots”.
Base64 has a number of different variants depending on where the encoding will be used. Typical MIME base64 will use +/ as the final two characters, but Base64-URL (RFC 4648 §5) is intended to be used in URLs and filenames, so uses -_ instead.
Therefore a JWT will use the characters a–z, A–Z, 0–9, and -_.. Or, as a regular expression:
[a-zA-Z0-9-_.]+
If you want to improve on the regex to match the format described:
^[a-zA-Z0-9-_]+\.[a-zA-Z0-9-_]+\.[a-zA-Z0-9-_]+$
Depending on your flavour of regex, \w should match [a-zA-Z0-9_] so you might be able to make this look a bit neater:
^[\w-]+\.[\w-]+\.[\w-]+$

Related

Is it appropriate or necessary to use percent-encoding with HTTP Headers?

When I'm building RESTful client and servers, is it appropriate or necessary to use percent-encoding with HTTP Headers (request or response), or does this type of encoding just apply to URIs?
Basically No, but see below.
RFC2616 describes percent-encoding only for URIs (search for % or HEX HEX or percent) and it defines the field-value without mentioning percent-encoding.
However, RFC2616 allows arbitraty octets (except CTLs) in the header field value, and has a half-baked statement mentioning MIME encoding (RFC2047) for characters not in ISO-8859-1 (see definition of TEXT in its Section 2.2). I called that statement "half-baked" because it does not exlictly state that ISO-8859-1 is the mandatory character set to be used for interpreting the octets, but despite of that, it normatively requires the use of MIME encoding for characters outside of that character set. It seems that both the use of ISO-8859-1 and the MIME encoding of header field values are not widely supported.
HTTPbis seems to have given up on this, and goes back to US-ASCII for header field values. See this answer for details.
My reading of this is:
For standard header fields (those defined in RFC2616), percent-encoding is not permitted.
For extension header fields, percent-encoding is not described in RFC2616, but there is room for applying all kinds of encodings, including percent-encoding, as long as the resulting characters are US-ASCII (if you want to be future-proof). Just don't think you have to use percent-encoding.
Some more sources I found:
https://www.quora.com/Do-HTTP-headers-need-to-be-encoded confirms my understanding, although it is not specific about standard headers vs extension headers and does not quote a source.
https://support.ca.com/us/knowledge-base-articles.TEC1904612.html argues that the percent-encoding of extension headers in their product is a measure of protection against CSS attacks.
TL;DR: Octet percent-encoding and base64 encoding are fine.
Indicating Character Encoding and Language for HTTP Header Field Parameters
https://www.rfc-editor.org/rfc/rfc8187
This document specifies an encoding suitable for use in HTTP header
fields...
Read the "3.2.3. Examples"
base64 encoding is fine too, as read the HTTP Basic Authorziation spec: https://www.rfc-editor.org/rfc/rfc7617

Httperf: How to test REST api with endoded uri

I want to test my REST API which has a URI something like this:
/myrestAPI/search?startTime=0&endTime=10&count=8&filters={"params":
[{"field":"Topic","value":"Algorithms","type":"MATCH_EXACT"}]}
How would I do that. The httperf reply status is "505 HTTP Version Not Supported"
I know that this uri the httperf is not properly encoding and sending it..
How would I achieve that in httperf?
Since URLs often contain characters outside the ASCII set, the URL has to be converted into a valid ASCII format.
URL encoding replaces unsafe ASCII characters with a "%" followed by two hexadecimal digits.
For you case, it would be:
/myrestAPI/search?startTime=0&endTime=10&count=8&filters=%7B%22params%22%3A%20%5B%7B%22field%22%3A%22Topic%22%2C%22value%22%3A%22Algorithms%22%2C%22type%22%3A%22MATCH_EXACT%22%7D%5D%7D
Try to experiment with URL encoder/decoder

base64 encoding that doesn't use "+/=" (plus or equals) characters?

I need to encode a string of about 1000 characters that can be any byte value (00-FF). I don't want to use Hex because it's not dense enough. the problem with base64 as I understand it is that it includes + / and = which are characters I can not tolerate in my application.
Any suggestions?
Base58Check is an option. It is starting to become something of a de facto standard in cryptocurrency addresses.
Basic improvements over Base64:
Only alphanumeric characters [0-9a-zA-Z]
No look-alike characters: 0OIl / 0OIl
No punctuation to trigger word wrap or line break in documents and emails
Can also select entire value with a single double click due to no punctuation.
The Bitcoin Address Utility is an implementation example; geared for Bitcoins.
Note: A novel de facto standard may not be adequate for your needs. It is unclear if the Base58Check encoding method will formalise across current protocols.
Pick your replacements. Consider some other variants: base64 Variant table from Wikipedia.
While base64 encoder/decoders are trivial, replacement subsitution can be done in a simple pre/post processing step of an existing base64 encode/decode functions (inside wrappers) -- no need to re-invent the wheel (entirely). Or, better yet, as Mr. Skeet points out, find an existing library with enough flexibility.
If you have no alternative suitable "funny" characters to choose from (perhaps all the other characters are invalid leaving only the 62 alphanumeric characters to choose from), you can always use an escape character for a very slight (~3/64?) increase in size. For instance, 0 (A) would be encoded as "AA", 62 (+) would be encoded as "AB" and 63 (/) would be encoded as "AC". This too could be done as a pre/post step if you don't want to write your own encoder/decoder from the ground-up. The disadvantage with this approach is that the ratio of output characters to input bytes is not fixed.
If it's just those particular characters that bother you, and you can find some other characters to use instead, then how about implementing your own custom base64 module? It's not all that difficult.
You could use Base32 instead. Less dense than Base64, but eliminates unwanted characters completely.
As Ciaran says, base64 isn't terribly hard to implement - but you may want to have a look for existing libraries which allow you to specify a custom set of characters to use. I'm pretty sure there are plenty out there, but you haven't specified which platform you need this for.
Basically, you just need 65 ASCII characters which are acceptable - preferably in addition to line breaks.
Sure. Why not write your own Base64 encoder/decoder, but replace those chars in your algorithm. Sure, it will not be able to be decoded with a normal decoder, but if that's not an issue, then whyt worry about it. But, you better have at least 3 other chars that ARE useable in your app to represent the +/ and ='s...
base62 is essentially base64 but alphanumeric only.

Efficient way to ASCII encode UTF-8

I'm looking for a simple and efficient way to store UTF-8 strings in ASCII-7. With efficient I mean the following:
all ASCII alphanumeric chars in the input should stay the same ASCII alphanumeric chars in the output
the resulting string should be as short as possible
the operation needs to be reversable without any data loss
the resulting ASCII string should be case insensitive
there should be no restriction on the input length
the whole UTF-8 range should be allowed
My first idea was to use Punycode (IDNA) as it fits the first four requirements, but it fails at the last two.
Can anyone recommend an alternative encoding scheme? Even better if there's some code available to look at.
UTF-7, or, slightly less transparent but more widespread, quoted-printable.
all ASCII chars in the input should stay ASCII chars in the output
(Obviously not fully possible as you need at least one character to act as an escape.)
Since ASCII covers the full range of 7-bit values, an encoding scheme that preserves all ASCII characters, is 7-bits long, and encodes the full Unicode range is not possible.
Edited to add:
I think I understand your requirements now. You are looking for a way to encode UTF-8 strings in a seven-bit code, in which, if that encoded string were interpreted as ASCII text, then the case of the alphabetic characters may be arbitrarily modified, and yet the decoded string will be byte-for-byte identical to the original.
If that's the case, then your best bet would probably be just to encode the binary representation of the original as a string of hexadecimal digits. I know you are looking for a more compact representation, but that's a pretty tall order given the other constraints of the system, unless some custom encoding is devised.
Since the hexadecimal representation can encode any arbitrary binary values, it might be possible to shrink the string by compressing them before taking the hex values.
If you're talking about non-standard schemes - MECE
URL encoding or numeric character references are two possible options.
It depends on the distribution of characters in your strings.
Quoted-printable is good for mostly-ASCII strings because there's no overhead except with '=' and control characters. However, non-ASCII characters take an inefficient 6-12 bytes each, so if you have a lot of those, you'll want to consider UTF-7 or Base64 instead.
Punycode is used for IDNA, but you can use it outside the restrictions imposed by it
Per se, Punycode doesn't fail your last 2 requirements:
>>> import sys
>>> _ = ("\U0010FFFF"*10000).encode("punycode")
>>> all(chr(c).encode("punycode") for c in range(sys.maxunicode))
True
(for idna, python supplies another homonymous encoding)
obviously, if you don't nameprep the input, the encoded string isn't strictly case-insensitive anymore... but if you supply only lowercase (or if you don't care about the decoded case) you should be good to go

Unicode URL decoding

The usual method of URL-encoding a unicode character is to split it into 2 %HH codes. (\u4161 => %41%61)
But, how is unicode distinguished when decoding? How do you know that %41%61 is \u4161 vs. \x41\x61 ("Aa")?
Are 8-bit characters, that require encoding, preceded by %00?
Or, is the point that unicode characters are supposed to be lost/split?
According to Wikipedia:
Current standard
The generic URI syntax mandates that new URI schemes
that provide for the representation of
character data in a URI must, in
effect, represent characters from the
unreserved set without translation,
and should convert all other
characters to bytes according to
UTF-8, and then percent-encode those
values. This requirement was
introduced in January 2005 with the
publication of RFC 3986. URI schemes
introduced before this date are not
affected.
Not addressed by the current
specification is what to do with
encoded character data. For example,
in computers, character data manifests
in encoded form, at some level, and
thus could be treated as either binary
data or as character data when being
mapped to URI characters. Presumably,
it is up to the URI scheme
specifications to account for this
possibility and require one or the
other, but in practice, few, if any,
actually do.
Non-standard implementations
There exists a non-standard encoding
for Unicode characters: %uxxxx, where
xxxx is a Unicode value represented as
four hexadecimal digits. This behavior
is not specified by any RFC and has
been rejected by the W3C. The third
edition of ECMA-262 still includes an
escape(string) function that uses this
syntax, but also an encodeURI(uri)
function that converts to UTF-8 and
percent-encodes each octet.
So, it looks like its entirely up to the person writing the unencode method...Aren't standards fun?
What I've always done is first UTF-8 encode a Unicode string to make it a series of 8-bit characters before escaping any of those with %HH.
P.S. - I can only hope the non-standard implementations (%uxxxx) are few and far between.
Since URI's were introduced before unicode was around, or atleast in wide use, I imagine this is a very implementation specific question. UTF-8 encoding your text, then escaping that per normal sounds like the best idea, since that's completely backwards compatible with any ASCII/ANSI systems in place, though you might get the odd wierd character or two.
On the other end, to decode, you'd unescape your text, and get a UTF-8 string. If someone using an older system tries to send yours some data in ASCII/ANSI, there's no harm done, that's (almost) UTF-8 encoded already.