Is base64 encoding always one to one - encoding

Is it possible to obtain two identical encoded values for two different inputs to the Base64 encoding algorithm?
Let's use another algorithm for example, a function that replaces underscores with the letter X.
Foo_Bar = FooXBar
FooXBar = FooXBar
Can this sort of thing ever happen with Base64 encoding?

No, it cannot happen. Base64 is a lossless conversion (and it even needs a 33% space more). In math terms, the Base64 function is a bijection.
Note how HTTP basic access authentication use this encoding for the username and the password. Anyone can get the original strings from the encoded one, and for this reason this authentication should be used only under HTTPS.
You can find more details on Base64 also on Wikipedia.

No, base64 is just a way of encoding binary data as printable characters.
Strictly speaking, it's just a number system, like binary (base 2), decimal(base 10), or hexadecimal (base 16). Just as you can convert losslessly between those you can with base 64. In fact, mathematically bases are irrelevant, and are only used for notation and human use, math is equivalent no matter what base you use.

Related

Why Base64 is used "only" to encode binary data?

I saw many resources about the usages of base64 in today's internet. As I understand it, all of those resources seem to spell out single usecase in different ways : Encode binary data in Base64 to avoid getting it misinterpreted/corrupted as something else during transit (by intermediate systems). But I found nothing that explains following :
Why would binary data be corrupted by intermediate systems? If I am sending an image from a server to client, any intermediate servers/systems/routers will simply forward data to next appropriate servers/systems/routers in the path to client. Why would intermediate servers/systems/routers need to interpret something that it receives? Any example of such systems which may corrupt/wrongly interpret data that it receives, in today's internet?
Why do we fear only binary data to be corrupted. We use Base64 because we are sure that those 64 characters can never be corrupted/misinterpreted. But by this same logic, any text characters that do not belong to base64 characters can be corrupted/misinterpreted. Why then, base64 is use only to encode binary data? Extending the same idea, when we use browser are javascript and HTML files transferred in base64 form?
There's two reasons why Base64 is used:
systems that are not 8-bit clean. This stems from "the before time" where some systems took ASCII seriously and only ever considered (and transferred) 7bits out of any 8bit byte (since ASCII uses only 7 bits, that would be "fine", as long as all content was actually ASCII).
systems that are 8-bit clean, but try to decode the data using a specific encoding (i.e. they assume it's well-formed text).
Both of these would have similar effects when transferring binary (i.e. non-text) data over it: they would try to interpret the binary data as textual data in a character encoding that obviously doesn't make sense (since there is no character encoding in binary data) and as a consequence modify the data in an un-fixable way.
Base64 solves both of these in a fairly neat way: it maps all possible binary data streams into valid ASCII text: the 8th bit is never set on Base64-encoded data, because only regular old ASCII characters are used.
This pretty much solves the second problem as well, since most commonly used character encodings (with the notable exception of UTF-16 and UCS-2, among a few lesser-used ones) are ASCII compatible, which means: all valid ASCII streams happen to also be valid streams in most common encodings and represent the same characters (examples of these encodings are the ISO-8859-* family, UTF-8 and most Windows codepages).
As to your second question, the answer is two-fold:
textual data often comes with some kind of meta-data (either a HTTP header or a meta-tag inside the data) that describes the encoding to be used to interpret it. Systems built to handle this kind of data understand and either tolerate or interpret those tags.
in some cases (notably for mail transport) we do have to use various encoding techniques to ensure text doesn't get mangles. This might be the use of quoted-printable encoding or sometimes even wrapping text data in Base64.
Last but not least: Base64 has a serious drawback and that's that it's inefficient. For every 3 bytes of data to encode, it produces 4 bytes of output, thus increasing the size of the data by ~33%. That's why it should be avoided when it's not necessary.
One of the use of BASE64 is to send email.
Mail servers used a terminal to transmit data. It was common also to have translation, e.g. \c\r into a single \n and the contrary. Note: Also there where no guarantee that 8-bit can be used (email standard is old, and it allowed also non "internet" email, so with ! instead of #). Also systems may not be fully ASCII.
Also \n\n. is considered as end of body, and mboxes uses also \n>From to mark start of new mail, so also when 8-bit flag was common in mail servers, the problems were not totally solved.
BASE64 was a good way to remove all problems: the content is just send as characters that all servers must know, and the problem of encoding/decoding requires just sender and receiver agreement (and right programs), without worrying of the many relay server in between. Note: all \c, \r, \n etc. are just ignored.
Note: you can use BASE64 also to encode strings in URL, without worrying about the interpretation of webbrowsers. You may see BASE64 also in configuration files (e.g. to include icons): special crafted images may not be interpreted as configuration. Just BASE64 is handy to encode binary data into protocols which were not designed for binary data.

Encoding a hash to fit in less space

Whats the smallest hash I can get without making things overly collidable? I figure a good example is hashing "foo".
input = foo
sha1 = 0beec7b5ea3f0fdbc95d0dd47f3c5bc275da8a33
sha1 + b64 = C+7Hteo/D9vJXQ3UfzxbwnXaijM
Are there any other standards out there like Base64 that utilize unicode characters? maybe including upper/lower umlaut characters such as Ü and ü to pack more bits into each character? Ideally I'd love to compress the sha1 hash into 4-6 unicode characters I can tack onto a URL.
Reversibly encoding the hash doesn't impact collision rate... Unless your encoding causes some loss of data (then it isn't reversible any more).
Base64 and other binary-to-text encoding schemes are all reversible. Your first output is the hexadecimal (or base16) representation, which is 50% efficient. Base64 achieves 75% efficiency, meaning it cuts the 40-character hex representation to 28 characters.
The most efficient binary encoding scheme is yEnc, which achieves 98% efficiency, meaning a 100 byte long input will be roughly 102 bytes when encoded with yEnc. This is where the real problem arises for you: SHA-1 outputs are 160 bits (20 bytes) long. If you achieve 200% character-byte efficiency by using every 2-byte UTF16 character, you're still looking at 10 characters. You can't achieve this, because 2-byte values from U+D7FF to U+E000 are not valid UTF16 characters. Those byte values are reserved as prefixes for higher-plane characters.
Even if you find such a hyper-efficient1 encoding scheme using unicode, you can't really use those as URLs. Unicode characters are forbidden from URLs and to be standards compliant, you should use % encodings for your URLs. Many browsers will automatically convert them, so you may find this acceptable, but many of the characters you would regularly use would not be human readable and many more would appear to be in different languages.
At this point, if you really need short URLs, you should reconsider using a hash value and instead implement your own identity service (e.g. assign every page or resource an incremental ID, which is admittedly hard to scale) or utilizing another link-shortening service.
1: This not possible from a bit standpoint. Unicode could achieve a higher character-to-bit ratio, but the unicode characters themselves are represented by multiple bytes. The % encodings for UTF8, which most browsers use as the default for unrecognized encodings, get messy quickly.

What is base32 encoding?

There is enough information on how to implement base32 encoding or the specification of base32 encoding but I don't understand what it is, why we need it and where are the primary applications. Can someone please explain and give nice real life scenarios on usage? Thanks.
crockford base32
wikipedia base32
Like any other "ASCII-only" encoding, base32's primary purpose is to make sure that the data it encodes will survive transportation through systems or protocols which have special restrictions on the range of characters they will accept and emerge unmodified.
For example, b32-encoded data can be passed to a system that accepts single-byte character input, or UTF-8 encoded string input, or appended to a URL, or added to HTML content, without being mangled or resulting in an invalid form. Base64 (which is much more common) is used for the exact same reasons.
The main advantage of b32 over b64 is that it is much more human-readable. That's not much of an advantage because the data will typically be processed by computers, hence the relative rarity of b32 versus b64 (which is more efficient space-wise).
Update: there's the same question asked about Base64 here: What is base 64 encoding used for?
Base32 encoding (and Base64) encoding is motivated by situations where you need to encode unrestricted binary within a storage or transport system that allow only data of a certain form such as plain text. Examples include passing data through URLs, XML, or JSON data, all of which are plain text sort of formats that don't otherwise permit or support arbitrary binary data.
In addition to previous answers for base32 vs base64 in numbers. For same .pdf file encoded result is:
base64.base32encode(content) = 190400 symbols
base64.base64encode(content) = 158668 symbols

Efficient way to ASCII encode UTF-8

I'm looking for a simple and efficient way to store UTF-8 strings in ASCII-7. With efficient I mean the following:
all ASCII alphanumeric chars in the input should stay the same ASCII alphanumeric chars in the output
the resulting string should be as short as possible
the operation needs to be reversable without any data loss
the resulting ASCII string should be case insensitive
there should be no restriction on the input length
the whole UTF-8 range should be allowed
My first idea was to use Punycode (IDNA) as it fits the first four requirements, but it fails at the last two.
Can anyone recommend an alternative encoding scheme? Even better if there's some code available to look at.
UTF-7, or, slightly less transparent but more widespread, quoted-printable.
all ASCII chars in the input should stay ASCII chars in the output
(Obviously not fully possible as you need at least one character to act as an escape.)
Since ASCII covers the full range of 7-bit values, an encoding scheme that preserves all ASCII characters, is 7-bits long, and encodes the full Unicode range is not possible.
Edited to add:
I think I understand your requirements now. You are looking for a way to encode UTF-8 strings in a seven-bit code, in which, if that encoded string were interpreted as ASCII text, then the case of the alphabetic characters may be arbitrarily modified, and yet the decoded string will be byte-for-byte identical to the original.
If that's the case, then your best bet would probably be just to encode the binary representation of the original as a string of hexadecimal digits. I know you are looking for a more compact representation, but that's a pretty tall order given the other constraints of the system, unless some custom encoding is devised.
Since the hexadecimal representation can encode any arbitrary binary values, it might be possible to shrink the string by compressing them before taking the hex values.
If you're talking about non-standard schemes - MECE
URL encoding or numeric character references are two possible options.
It depends on the distribution of characters in your strings.
Quoted-printable is good for mostly-ASCII strings because there's no overhead except with '=' and control characters. However, non-ASCII characters take an inefficient 6-12 bytes each, so if you have a lot of those, you'll want to consider UTF-7 or Base64 instead.
Punycode is used for IDNA, but you can use it outside the restrictions imposed by it
Per se, Punycode doesn't fail your last 2 requirements:
>>> import sys
>>> _ = ("\U0010FFFF"*10000).encode("punycode")
>>> all(chr(c).encode("punycode") for c in range(sys.maxunicode))
True
(for idna, python supplies another homonymous encoding)
obviously, if you don't nameprep the input, the encoded string isn't strictly case-insensitive anymore... but if you supply only lowercase (or if you don't care about the decoded case) you should be good to go

What is base 64 encoding used for?

I've heard people talking about "base 64 encoding" here and there. What is it used for?
When you have some binary data that you want to ship across a network, you generally don't do it by just streaming the bits and bytes over the wire in a raw format. Why? because some media are made for streaming text. You never know -- some protocols may interpret your binary data as control characters (like a modem), or your binary data could be screwed up because the underlying protocol might think that you've entered a special character combination (like how FTP translates line endings).
So to get around this, people encode the binary data into characters. Base64 is one of these types of encodings.
Why 64?
Because you can generally rely on the same 64 characters being present in many character sets, and you can be reasonably confident that your data's going to end up on the other side of the wire uncorrupted.
It's basically a way of encoding arbitrary binary data in ASCII text. It takes 4 characters per 3 bytes of data, plus potentially a bit of padding at the end.
Essentially each 6 bits of the input is encoded in a 64-character alphabet. The "standard" alphabet uses A-Z, a-z, 0-9 and + and /, with = as a padding character. There are URL-safe variants.
Wikipedia is a reasonably good source of more information.
Years ago, when mailing functionality was introduced, so that was utterly text based, as the time passed, need for attachments like image and media (audio,video etc) came into existence. When these attachments are sent over internet (which is basically in the form of binary data), the probability of binary data getting corrupt is high in its raw form. So, to tackle this problem BASE64 came along.
The problem with binary data is that it contains null characters which in some languages like C,C++ represent end of character string so sending binary data in raw form containing NULL bytes will stop a file from being fully read and lead in a corrupt data.
For Example :
In C and C++, this "null" character shows the end of a string. So "HELLO" is stored like this:
H E L L O
72 69 76 76 79 00
The 00 says "stop here".
Now let’s dive into how BASE64 encoding works.
Point to be noted : Length of the string should be in multiple of 3.
Example 1 :
String to be encoded : “ace”, Length=3
Convert each character to decimal.
a= 97, c= 99, e= 101
Change each decimal to 8-bit binary representation.
97= 01100001, 99= 01100011, 101= 01100101
Combined : 01100001 01100011 01100101
Separate in a group of 6-bit.
011000 010110 001101 100101
Calculate binary to decimal
011000= 24, 010110= 22, 001101= 13, 100101= 37
Covert decimal characters to base64 using base64 chart.
24= Y, 22= W, 13= N, 37= l
“ace” => “YWNl”
Example 2 :
String to be encoded : “abcd” Length=4, it's not multiple of 3. So to make string length multiple of 3 , we must add 2 bit padding to make length= 6. Padding bit is represented by “=” sign.
Point to be noted : One padding bit equals two zeroes 00 so two padding bit equals four zeroes 0000.
So lets start the process :–
Convert each character to decimal.
a= 97, b= 98, c= 99, d= 100
Change each decimal to 8-bit binary representation.
97= 01100001, 98= 01100010, 99= 01100011, 100= 01100100
Separate in a group of 6-bit.
011000, 010110, 001001, 100011, 011001, 00
so the last 6-bit is not complete so we insert two padding bit which equals four zeroes “0000”.
011000, 010110, 001001, 100011, 011001, 000000 ==
Now, it is equal. Two equals sign at the end show that 4 zeroes were added (helps in decoding).
Calculate binary to decimal.
011000= 24, 010110= 22, 001001= 9, 100011= 35, 011001= 25, 000000=0 ==
Covert decimal characters to base64 using base64 chart.
24= Y, 22= W, 9= j, 35= j, 25= Z, 0= A ==
“abcd” => “YWJjZA==”
Base-64 encoding is a way of taking binary data and turning it into text so that it's more easily transmitted in things like e-mail and HTML form data.
http://en.wikipedia.org/wiki/Base64
It's a textual encoding of binary data where the resultant text has nothing but letters, numbers and the symbols "+", "/" and "=". It's a convenient way to store/transmit binary data over media that is specifically used for textual data.
But why Base-64? The two alternatives for converting binary data into text that immediately spring to mind are:
Decimal: store the decimal value of each byte as three numbers: 045 112 101 037 etc. where each byte is represented by 3 bytes. The data bloats three-fold.
Hexadecimal: store the bytes as hex pairs: AC 47 0D 1A etc. where each byte is represented by 2 bytes. The data bloats two-fold.
Base-64 maps 3 bytes (8 x 3 = 24 bits) in 4 characters that span 6-bits (6 x 4 = 24 bits). The result looks something like "TWFuIGlzIGRpc3Rpb...". Therefore the bloating is only a mere 4/3 = 1.3333333 times the original.
Aside from what's already been said, two very common uses that have not been listed are
Hashes:
Hashes are one-way functions that transform a block of bytes into another block of bytes of a fixed size such as 128bit or 256bit (SHA/MD5). Converting the resulting bytes into Base64 makes it much easier to display the hash especially when you are comparing a checksum for integrity. Hashes are so often seen in Base64 that many people mistake Base64 itself as a hash.
Cryptography:
Since an encryption key does not have to be text but raw bytes it is sometimes necessary to store it in a file or database, which Base64 comes in handy for. Same with the resulting encrypted bytes.
Note that although Base64 is often used in cryptography is not a security mechanism. Anyone can convert the Base64 string back to its original bytes, so it should not be used as a means for protecting data, only as a format to display or store raw bytes more easily.
Certificates
x509 certificates in PEM format are base 64 encoded. http://how2ssl.com/articles/working_with_pem_files/
In the early days of computers, when telephone line inter-system communication was not particularly reliable, a quick & dirty method of verifying data integrity was used: "bit parity". In this method, every byte transmitted would have 7-bits of data, and the 8th would be 1 or 0, to force the total number of 1 bits in the byte to be even.
Hence 0x01 would be transmited as 0x81; 0x02 would be 0x82; 0x03 would remain 0x03 etc.
To further this system, when the ASCII character set was defined, only 00-7F were assigned characters. (Still today, all characters set in the range 80-FF are non-standard)
Many routers of the day put the parity check and byte translation into hardware, forcing the computers attached to them to deal strictly with 7-bit data. This force email attachments (and all other data, which is why HTTP & SMTP protocols are text-based), to be convert into a text-only format.
Few of the routers survived into the 90s. I severely doubt any of them are in use today.
From http://en.wikipedia.org/wiki/Base64
The term Base64 refers to a specific MIME content transfer encoding.
It is also used as a generic term for any similar encoding scheme that
encodes binary data by treating it numerically and translating it into
a base 64 representation. The particular choice of base is due to the
history of character set encoding: one can choose a set of 64
characters that is both part of the subset common to most encodings,
and also printable. This combination leaves the data unlikely to be
modified in transit through systems, such as email, which were
traditionally not 8-bit clean.
Base64 can be used in a variety of contexts:
Evolution and Thunderbird use Base64 to obfuscate e-mail passwords[1]
Base64 can be used to transmit and store text that might otherwise cause delimiter collision
Base64 is often used as a quick but insecure shortcut to obscure secrets without incurring the overhead of cryptographic key management
Spammers use Base64 to evade basic anti-spamming tools, which often do not decode Base64 and therefore cannot detect keywords in encoded
messages.
Base64 is used to encode character strings in LDIF files
Base64 is sometimes used to embed binary data in an XML file, using a syntax similar to ...... e.g.
Firefox's bookmarks.html.
Base64 is also used when communicating with government Fiscal Signature printing devices (usually, over serial or parallel ports) to
minimize the delay when transferring receipt characters for signing.
Base64 is used to encode binary files such as images within scripts, to avoid depending on external files.
Can be used to embed raw image data into a CSS property such as background-image.
Some transportation protocols only allow alphanumerical characters to be transmitted. Just imagine a situation where control characters are used to trigger special actions and/or that only supports a limited bit width per character. Base64 transforms any input into an encoding that only uses alphanumeric characters, +, / and the = as a padding character.
Base64 is a binary to a text encoding scheme that represents binary data in an ASCII string format. It is designed to carry data stored in binary format across the network channels.
Base64 mechanism uses 64 characters to encode. These characters consist of:
10 numeric value: i.e., 0,1,2,3,...,9
26 Uppercase alphabets: i.e., A,B,C,D,...,Z
26 Lowercase alphabets: i.e., a,b,c,d,...,z
2 special characters (these characters depends on operating system): i.e. +,/
How base64 works
The steps to encode a string with base64 algorithm are as follow:
Count the number of characters in a String. If it is not multiple of 3, then pad it with special characters (i.e. =) to make it multiple of 3.
Convert string to ASCII binary format 8-bit using the ASCII table.
After converting to binary format, divide binary data into chunks of 6-bits.
Convert chunks of 6-bit binary data to decimal numbers.
Convert decimals to string according to the base64 Index Table. This table can be an example, but as I said, 2 special characters may vary.
Now, we got the encoded version of the input string.
Let's make an example: convert string THS to base64 encoding string.
Count the number of characters: it is already a multiple of 3.
Convert to ASCII binary format 8-bit. We got (T)01010100 (H)01001000 (S)01010011
Divide binary data into chunks of 6-bits. We got 010101 000100 100001 010011
Convert chunks of 6-bit binary data to decimal numbers.We got 21 4 33 19
Convert decimals to string according to the base64 Index Table. We got VEhT
It's used for converting arbitrary binary data to ASCII text.
For example, e-mail attachments are sent this way.
“Base64 encoding schemes are commonly used when there is a need to encode binary data that needs be stored and transferred over media that are designed to deal with textual data. This is to ensure that the data remains intact without modification during transport”(Wiki, 2017)
Example could be the following: you have a web service that accept only ASCII chars. You want to save and then transfer user’s data to some other location (API) but recipient want receive untouched data. Base64 is for that. . . The only downside is that base64 encoding will require around 33% more space than regular strings.
Another Example:: uenc = url encoded = aHR0cDovL2xvYy5tYWdlbnRvLmNvbS9hc2ljcy1tZW4tcy1nZWwta2F5YW5vLXhpaS5odG1s = http://loc.querytip.com/asics-men-s-gel-kayano-xii.html.
As you can see we can’t put char “/” in URL if we want to send last visited URL as parameter because we would break attribute/value rule for “MOD rewrite” – GET parameter.
A full example would be: “http://loc.querytip.com/checkout/cart/add/uenc/http://loc.magento.com/asics-men-s-gel-kayano-xii.html/product/93/”
I use it in a practical sense when we transfer large binary objects (images) via web services. So when I am testing a C# web service using a python script, the binary object can be recreated with a little magic.
[In python]
import base64
imageAsBytes = base64.b64decode( dataFromWS )
The usage of Base64 I'm going to describe here is somewhat a hack. So if you don't like hacks, please do not go on.
I went into trouble when I discovered that MySQL's utf8 does not support 4-byte unicode characters since it uses a 3-byte version of utf8. So what I did to support full 4-byte unicode over MySQL's utf8? Well, base64 encode strings when storing into the database and base64 decode when retrieving.
Since base64 encoding and decoding is very fast, the above worked perfectly.
You have the following points to take note of:
Base64 encoding uses 33% more storage
Strings stored in the database wont be human readable (You could sell that as a feature that database strings use a basic form of encryption).
You could use the above method for any storage engine that does not support unicode.
Mostly, I've seen it used to encode binary data in contexts that can only handle ascii - or a simple - character sets.
The base64 is a binary to a text encoding scheme that represents binary data in an ASCII string format. base64 is designed to carry data stored in binary format across the channels. It takes any form of data and transforms it into a long string of plain text. Earlier we can not transfer a large amount of data like files because it is made up of 2⁸ bit bytes but our actual network uses 2⁷ bit bytes. This is where base64 encoding came into the picture. But, what actually does base64 mean?
let’s understand the meaning of base64.
base64 = base+64
we can call base64 as a radix-64 representation.base64 uses only 6-bits(2⁶ = 64 characters) to ensure the printable data is human readable. but, how? we can also write base65 or base78, but why only 64? let’s prove it.
base64 encoding contains 64 characters to encode any string.
base64 contains:
10 numeric value i.e., 0,1,2,3,…..9.
26 Uppercase alphabets i.e., A,B,C,D,…….Z.
26 Lowercase alphabets i.e., a,b,c,d,……..z.
two special characters i.e., +,/. Depends upon your OS.
The steps followed by the base64 algorithm are as follow:
count the number of characters in a String.
If it is not multiple of 3 pad with special character i.e., = to
make it multiple of 3.
Encode the string in ASCII format.
Now, it will convert the ASCII to binary format 8-bit each.
After converting to binary format, it will divide binary data into
chunks of 6-bits each.
The chunks of 6-bit binary data will now be converted to decimal
number format.
Using the base64 Index Table, the decimals will be again converted
to a string according to the table format.
Finally, we will get the encoded version of our input string.
To expand a bit on what Brad is saying: many transport mechanisms for email and Usenet and other ways of moving data are not "8 bit clean", which means that characters outside the standard ascii character set might be mangled in transit - for instance, 0x0D might be seen as a carriage return, and turned into a carriage return and line feed. Base 64 maps all the binary characters into several standard ascii letters and numbers and punctuation so they won't be mangled this way.
One hexadecimal digit is of one nibble (4 bits). Two nibbles make 8 bits which are also called 1 byte.
MD5 generates a 128-bit output which is represented using a sequence of 32 hexadecimal digits, which in turn are 32*4=128 bits. 128 bits make 16 bytes (since 1 byte is 8 bits).
Each Base64 character encodes 6 bits (except the last non-pad character which can encode 2, 4 or 6 bits; and final pad characters, if any). Therefore, per Base64 encoding, a 128-bit hash requires at least ⌈128/6⌉ = 22 characters, plus pad if any.
Using base64, we can produce the encoded output of our desired length (6, 8, or 10).
If we choose to decide 8 char long output, it occupies only 8 bytes whereas it was occupying 16 bytes for 128-bit hash output.
So, in addition to security, base64 encoding is also used to reduce the space consumed.
Base64 can be used for many purposes.
The primary reason is to convert binary data to something passable.
I sometimes use it to pass JSON data around from one site to another, store information
in cookies about a user.
Note:
You "can" use it for encryption - I don't see why people say you can't, and that it's not encryption, although it would be easily breakable and is frowned upon. Encryption means nothing more than converting one string of data to another string of data that can be either later decrypted or not, and that's what base64 does.