Compression or encrypt data - lossless-compression

I have two bytes and I want to compress them into a single byte using a key( key length can be up to 64 bits).
And further I want to be able to retrieve the two bytes by using the compressed byte and the same key.
Someone has an idea how to do that?
Thanks.

There are 2^{16} = 65,536 ways two choose a pair of 8-bit bytes.
However the result of your procedure is only one 8-bit byte, which can occur in 2^8 = 256 different variations.
So you could use this one byte as input to some decompressing procedure, but because there are only 256 different inputs, the procedure can not produce more than 256 different results, so you can retrieve no more than 256 of the 65,536 possible pairs, the other pairs are not accessible, because you ran out of names for them, so to say.
This makes the procedure impractical, if more than 256 different input byte pairs occur.
(See the comments below for more details)
Compression would only be practical, if restrictions on your input data exist. E.g. if only the pairs p1 = (42,37) and p2 = (127,255) can occur as possible input you could compress them as 01 and and 02.

Related

Which hashing algorithm generates alphanumeric output?

I am looking for an hashing algorithm that generates alphanumeric output. I did few tests with MD5 , SHA3 etc and they produce hexadecimal output.
Example:
Input: HelloWorld
Output[sha3_256]: 92dad9443e4dd6d70a7f11872101ebff87e21798e4fbb26fa4bf590eb440e71b
The 1st character in the above output is 9. Since output is in HEX format, maximum possible values are [0-9][a-f]
I am trying to achieve maximum possible values for the 1st character. [0-9][a-z][A-Z]
Any ideas would be appreciated . Thanks in advance.
Where MD5 computes a 128bit hash and SHA256 a 256bit hash, the output they provide is nothing more than a 128, respectively 256 long binary number. In short, that are a lot of zero's and ones. In order to use a more human-friendly representation of binary-coded values, Software developers and system designers use hexadecimal numbers, which is a representation in base(16). For example, an 8-bit byte can have values ranging from 00000000 to 11111111 in binary form, which can be conveniently represented as 00 to FF in hexadecimal.
You could convert this binary number into a base(32) if you want. This is represented using the characters "A-Z2-7". Or you could use base(64) which needs the characters "A-Za-z0-9+/". In the end, it is just a representation.
There is, however, some practical use to base(16) or hexadecimal. In computer lingo, a byte is 8 bits and a word consists of two bytes (16 bits). All of these can be comfortably represented hexadecimally as 28 = 24×24 = 16×16. Where 28 = 25×23 = 32×8. Hence, in base(32), a byte is not cleanly represented. You already need 5 bytes to have a clean base(32) representation with 8 characters. That is not comfortable to deal with on a daily basis.

Is it possible to encode a series of bytes this way?

Given:
You have a large series of bytes -- call it O.
You a pair of bytes (2 bytes) -- call it E.
Is it possible?
Can you somehow encode the O series with the E pair to produce a new series S that is the same size (length) as O, such that given S, and S alone, you can derive the original series O and pair E?
No.
I assume we are talking about completely random data here.
There is the amount of information that can be stored in length of O bytes. Which is the same amount of bytes in S. Every combination of bytes may be a valid dataset.
It is not possible to store more information with the same amount of bytes.
At least for completely random data (like hashes or encrypted data)
As soon as you know anything about the data it's a different story. Non random data means that the data might take up more space than necessary. Therefore there could be space for more information.

Struggling with Base 64 encoding in T-SQL - and the padding [duplicate]

What is the purpose of padding in base64 encoding. The following is the extract from wikipedia:
"An additional pad character is allocated which may be used to force the encoded output into an integer multiple of 4 characters (or equivalently when the unencoded binary text is not a multiple of 3 bytes) ; these padding characters must then be discarded when decoding but still allow the calculation of the effective length of the unencoded text, when its input binary length would not be not a multiple of 3 bytes (the last non-pad character is normally encoded so that the last 6-bit block it represents will be zero-padded on its least significant bits, at most two pad characters may occur at the end of the encoded stream)."
I wrote a program which could base64 encode any string and decode any base64 encoded string. What problem does padding solves?
Your conclusion that padding is unnecessary is right. It's always possible to determine the length of the input unambiguously from the length of the encoded sequence.
However, padding is useful in situations where base64 encoded strings are concatenated in such a way that the lengths of the individual sequences are lost, as might happen, for example, in a very simple network protocol.
If unpadded strings are concatenated, it's impossible to recover the original data because information about the number of odd bytes at the end of each individual sequence is lost. However, if padded sequences are used, there's no ambiguity, and the sequence as a whole can be decoded correctly.
Edit: An Illustration
Suppose we have a program that base64-encodes words, concatenates them and sends them over a network. It encodes "I", "AM" and "TJM", sandwiches the results together without padding and transmits them.
I encodes to SQ (SQ== with padding)
AM encodes to QU0 (QU0= with padding)
TJM encodes to VEpN (VEpN with padding)
So the transmitted data is SQQU0VEpN. The receiver base64-decodes this as I\x04\x14\xd1Q) instead of the intended IAMTJM. The result is nonsense because the sender has destroyed information about where each word ends in the encoded sequence. If the sender had sent SQ==QU0=VEpN instead, the receiver could have decoded this as three separate base64 sequences which would concatenate to give IAMTJM.
Why Bother with Padding?
Why not just design the protocol to prefix each word with an integer length? Then the receiver could decode the stream correctly and there would be no need for padding.
That's a great idea, as long as we know the length of the data we're encoding before we start encoding it. But what if, instead of words, we were encoding chunks of video from a live camera? We might not know the length of each chunk in advance.
If the protocol used padding, there would be no need to transmit a length at all. The data could be encoded as it came in from the camera, each chunk terminated with padding, and the receiver would be able to decode the stream correctly.
Obviously that's a very contrived example, but perhaps it illustrates why padding might conceivably be helpful in some situations.
On a related note, here's an arbitrary base converter I created for you. Enjoy!
https://convert.zamicol.com/
What are Padding Characters?
Padding characters help satisfy length requirements and carry no other meaning.
Decimal Example of Padding:
Given the arbitrary requirement all strings be 8 characters in length, the number 640 can meet this requirement using preceding 0's as padding characters as they carry no meaning, "00000640".
Binary Encoding
The Byte Paradigm: For encoding, the byte is the de facto standard unit of measurement and any scheme must relate back to bytes.
Base256 fits exactly into the byte paradigm. One byte is equal to one character in base256.
Base16, hexadecimal or hex, uses 4 bits for each character. One byte can represent two base16 characters.
Base64 does not fit evenly into the byte paradigm (nor does base32), unlike base256 and base16. All base64 characters can be represented in 6 bits, 2 bits short of a full byte.
We can represent base64 encoding versus the byte paradigm as a fraction: 6 bits per character over 8 bits per byte. Reduced this fraction is 3 bytes over 4 characters.
This ratio, 3 bytes for every 4 base64 characters, is the rule we want to follow when encoding base64. Base64 encoding can only promise even measuring with 3 byte bundles, unlike base16 and base256 where every byte can stand on it's own.
So why is padding encouraged even though encoding could work just fine without the padding characters?
If the length of a stream is unknown or if it could be helpful to know exactly when a data stream ends, use padding. The padding characters communicate explicitly that those extra spots should be empty and rules out any ambiguity. Even if the length is unknown with padding you'll know where your data stream ends.
As a counter example, some standards like JOSE don't allow padding characters. In this case, if there is something missing, a cryptographic signature won't work or other non base64 characters will be missing (like the "."). Although assumptions about length aren't made, padding isn't needed because if there is something wrong it simply won't work.
And this is exactly what the base64 RFC says,
In some circumstances, the use of padding ("=") in base-encoded data
is not required or used. In the general case, when assumptions about
the size of transported data cannot be made, padding is required to
yield correct decoded data.
[...]
The padding step in base 64 [...] if improperly
implemented, lead to non-significant alterations of the encoded data.
For example, if the input is only one octet for a base 64 encoding,
then all six bits of the first symbol are used, but only the first
two bits of the next symbol are used. These pad bits MUST be set to
zero by conforming encoders, which is described in the descriptions
on padding below. If this property do not hold, there is no
canonical representation of base-encoded data, and multiple base-
encoded strings can be decoded to the same binary data. If this
property (and others discussed in this document) holds, a canonical
encoding is guaranteed.
Padding allows us to decode base64 encoding with the promise of no lost bits. Without padding there is no longer the explicit acknowledgement of measuring in three byte bundles. Without padding you may not be able to guarantee exact reproduction of original encoding without additional information usually from somewhere else in your stack, like TCP, checksums, or other methods.
Alternatively to bucket conversion schemes like base64 is radix conversion which has no arbitrary bucket sizes and for left-to-right readers is left padded. The "iterative divide by radix" conversion method is typically employed for radix conversions.
Examples
Here is the example form RFC 4648 (https://www.rfc-editor.org/rfc/rfc4648#section-8)
Each character inside the "BASE64" function uses one byte (base256). We then translate that to base64.
BASE64("") = "" (No bytes used. 0 % 3 = 0)
BASE64("f") = "Zg==" (One byte used. 1 % 3 = 1)
BASE64("fo") = "Zm8=" (Two bytes. 2 % 3 = 2)
BASE64("foo") = "Zm9v" (Three bytes. 3 % 3 = 0)
BASE64("foob") = "Zm9vYg==" (Four bytes. 4 % 3 = 1)
BASE64("fooba") = "Zm9vYmE=" (Five bytes. 5 % 3 = 2)
BASE64("foobar") = "Zm9vYmFy" (Six bytes. 6 % 3 = 0)
Here's an encoder that you can play around with: http://www.motobit.com/util/base64-decoder-encoder.asp
There is not much benefit to it in the modern day. So let's look at this as a question of what the original historical purpose may have been.
Base64 encoding makes its first appearance in RFC 1421 dated 1993. This RFC is actually focused on encrypting email, and base64 is described in one small section 4.3.2.4.
This RFC does not explain the purpose of the padding. The closest we have to a mention of the original purpose is this sentence:
A full encoding quantum is always completed at the end of a message.
It does not suggest concatenation (top answer here), nor ease of implementation as an explicit purpose for the padding. However, considering the entire description, it is not unreasonable to assume that this may have been intended to help the decoder read the input in 32-bit units ("quanta"). That is of no benefit today, however in 1993 unsafe C code would have very likely actually taken advantage of this property.
With padding, a base64 string always has a length that is a multiple of 4 (if it doesn't, the string has been corrupted for sure) and thus code can easily process that string in a loop that processes 4 characters at a time (always converting 4 input characters to three or less output bytes). So padding makes sanity checking easy (length % 4 != 0 ==> error as not possible with padding) and it makes processing simpler and more efficient.
I know what people will think: Even without padding, I can process all 4-byte chunks in a loop and then just add special handling for the last 1 to 3 bytes, if those exist. It's just a few lines of extra code and the speed difference will be too tiny to even measure. Probably true but you are thinking in terms of C (or higher languages) and a powerful CPU with plenty of RAM. What if you need to decode base64 in hardware, using a simple DSP, that has very limited processing power, no RAM storage and you have to write the code in very limited micro-assembly? What if you cannot use code at all and everything has to be done with just transistors stacked together (a hardwired hardware implementation)? With padding that's way simpler than without.
Padding fills the output length to a multiple of four bytes in a defined way.

What should be the data type for the hashed value of a password encrypted using PBKDF2?

I am trying to learn to use PBKDF2 hash functions for storing passwords in the database. I have a rough draft of the procedure that I'll be using to generate the hashed function. But while I am creating the table in PL/SQL Developer which will hold the generated hashed password, what should I declare the data type for the encrypted password variable?
It might be a lame question but I'm trying to learn online. It would be a huge help if I can get links for further study as well. thank you. please help
The first link, as always, is Thomas Pornin's canonical answer to How to securely hash passwords.
Storage in the database
The hash can be stored in BINARY format for the least transformations and smallest number of bytes; see below for sizes.
Alternately, store it in a CHAR after converting to hex, which costs a transformation and double the bytes of the BINARY size
Alternatively, store it in a CHAR after converting to Base64, which costs a transformation and 4/3rds the number of bytes of BINARY size plus padding
i.e. PBKDF2-HMAC-SHA-512 where all 64 bytes of output are used would be
BINARY(64) as binary
CHAR(128) as hex
CHAR(88) as Base64
The number of iterations should be stored in an INT, so it can be trivially increased later
The salt, which must be a per-user, cryptographically random value, can be stored in a BINARY format for the smallest number of bytes, and should be at least 12, and preferably 16-24 bytes long.
i.e. for a 16 byte binary salt
BINARY(16) as binary
CHAR(32) as hex
CHAR(24) as Base64
Optionally a password hash algorithm version as a small INT type
i.e. 1 for PBKDF2-HMAC-SHA-512, and then later if you change to BCrypt, 2 for BCrypt, etc.
Normal PBKDF2 considerations
Consider using PBKDF2-HMAC-SHA-512, as SHA-512 in particular has 64-bit operations that reduce the advantage most GPU based attackers have over you as of early 2016.
Use a high (hundreds of thousands or high tens of thousands) of iterations.
Don't ask for a larger number of PBKDF2 output bytes than the native hash function supports
SHA-512 <= 64 bytes
SHA-384 <= 48 bytes
SHA-256 <= 32 bytes
SHA-224 <= 28 bytes
MD5 <= 20 bytes

pbkdf2 key length

What is the $key_length in PBKDF2
It says that it will be derived from the input, but I see people using key_lengths of 256 and greater, but when I enter 256 as a key_length the output is 512 characters. Is this intentional? Can I safely use 64 as the key_length so the output is 128 characters long?
$key_length is the number of output bytes that you desire from PBKDF2. (Note that if key_length is more than the number of output bytes of the hash algorithm, the process is repeated twice, slowing down that hashing perhaps more than you desire. SHA256 gives 32 bytes of output, for example, so asking for 33 bytes will take roughly twice as long as asking for 32.)
The doubling of the length that you mention is because the code converts the output bytes to hexadecimal (i.e. 2 characters per 1 byte) unless you specify $raw_output = true. The test vectors included specify $raw_output = false, since hexadecimal is simply easier to work with and post online. Depending on how you are storing the data in your application, you can decide if you want to store the results as hex, base64, or just raw binary data.
In the IETF specification of Password-Based Cryptography Specification Version 2.0 the key length is defined as
"intended length in octets of the derived key, a positive integer, at most
(2^32 - 1) * hLen" Here hLen denotes the length in octets of the pseudorandom function output. For further details on pbkdf2 you can refer How to store passwords securely with PBKDF2