How to encode data to remove any 0x00 bytes

How to encode data to remove any 0x00 bytes - encoding

I am streaming some data of fixed length. The data consists of some 64-bit ints as well as some 32-bit floats. Because the format of the data is fixed and known, I am just sending an array of bytes with a known endian-ness. The data can then be easily reconstructed at the other end.
However, my transport protocol will not allow any 0x00 bytes. Is there a way I can encode my data differently to avoid this? Losing some range in the data is fine (e.g. ints having a maximum of 2^60 is totally fine). Incresing the full size of the message is totally fine too, as long as the full length of data is fixed no matter what the values of the ints and floats are (e.g. if ints now take 9 bytes to store).
I don't know much about encoding formats, but I learned about CRCs a long time ago and I'm wondering if there's something like that, which will add some fixed length block to the end of the bytestream, but which will prevent the bytestream from containing any 0x00 bytes?

Let's take the case of 64-bit numbers:
Reduce your value range to 256 and use the last (or first) 7 bytes to encode that value.
For the 7 value bytes, replace all the 0x00 bytes with 0xff bytes. Record the positions of the bytes that have been flipped.
Use the remaining byte as a bit mask to encode the positions of the bytes that have been flipped. This will take up 7 bits of that remaining byte. The first (or last) bit of that byte needs to be always set to 1 to prevent the encoding byte to become 0x00 itself.
For example:
Take the 7 byte value b2 00 c3 d4 e5 ff 00.
Flip the 0x00 bytes to get b2 ff c3 d4 e5 ff ff. Bytes 2 and 7 have been flipped.
Create the bit mask 0100001 and prefix with a 1 bit to get a binary value of 10100001, or a hex value of 0xa1.
Your encoded 64-bit value will then be a1 b2 ff c3 d4 e5 ff ff.
The approach for 32-bit numbers is the same. Use 28 bits for the value, 3 bits to encode which bytes have been flipped, and the leftover bit always set to 1.

Related

Oddity when encoding large integers using asn.1

I have found numerous references to the encoding requirements of Integers in ASN.1
and that Integers are inherently signed objects
TLV 02 02 0123 for exmaple.
However, I have a 256 bit integer (within a certificate) encoded
30 82 01 09 02 82 01 00 d1 a5 xx xx xx… 02 03 010001
30 start
82 2 byte length
0109 265 bytes
02 Integer
82 2 byte length
0100 256 bytes
d1 a5 xxxx
The d1 is the troubling part because the leading bit is 1, meaning this 256 bit number is signed when in fact it is an unsigned number, a public rsa key infact. Does the signed constraint apply to Integers > 64 bits?
Thanks,

BER/DER uses 2s-complement representation for encoding integer values. This means the the first bit (not byte) determines whether a number is positive or negative. This means that sometimes an extra leading zero byte needs to be added to prevent the first bit from causing the integer to be interpreted as a negative number. Note that it is invalid BER/DER to have the first 9 bits all zero.

Yes, you are right. For any non negative DER/BER-encoded INTEGER - no matter its length - the MSB of the first payload byte is 0.
The program that generated such key is incorrect.

The "signed constraint" (actually, a rule) totally applies to any size integers. However, depending on a domain you might find all sorts of oddities in how domain objects are encoded. This is something that has to be learned and accounted for the hard way, unfortunately.

Isn’t on big endian machines UTF-8's byte order different than on little endian machines? So why then doesn’t UTF-8 require a BOM?

UTF-8 can contain a BOM. However, it
makes no difference as to the
endianness of the byte stream. UTF-8
always has the same byte order.
If Utf-8 stored all code-points in a single byte, then it would make sense why endianness doesn’t play any role and thus why BOM isn’t required. But since code points 128 and above are stored using 2, 3 and up to 6 bytes, which means their byte order on big endian machines is different than on little endian machines, so how can we claim Utf-8 always has the same byte order?
Thank you
EDIT:
UTF-8 is byte oriented
I understand that if two byte UTF-8 character C consists of bytes B1 and B2 ( where B1 is first byte and B2 is last byte ), then with UTF-8 those two bytes are always written in the same order ( thus if this character is written to a file on little endian machine LEM, B1 will be first and B2 last. Similarly, if C is written to a file on big endian machine BEM, B1 will still be first and B2 still last).
But what happens when C is written to file F on LEM, but we copy F to BEM and try to read it there? Since BEM automatically swaps bytes ( B1 is now last and B2 first byte ), how will app ( running on BEM ) reading F know whether F was created on BEM and thus order of two bytes wasn’t swapped or whether F was transferred from LEM, in which case BEM automatically swapped the bytes?
I hope question made some sense
EDIT 2:
In response to your edit: big-endian
machines do not swap bytes if you ask
them to read a byte at a time.
a) Oh, so even though character C is 2 bytes longs, app ( residing on BEM ) reading F will read into memory just one byte at the time ( thus it will first read into memory B1 and only then B2 )
b)
In UTF-8, you decide what to do with a
byte based on its high-order bits
Assuming file F has two consequent characters C and C1 ( where C consists of bytes B1 and B2 while C1 has bytes B3, B4 and B5 ). How will app reading F know which bytes belong together simply by checking each byte's high-order bits ( for example, how will it figure out that B1 and B2 taken together should represent a character and not B1,*B2* and B3)?
If you believe that you're seeing
something different, please edit your
question and include
I’m not saying that. I simply didn’t understand what was going on
c)Why aren't Utf-16 and Utf-32 also byte oriented?

The byte order is different on big endian vs little endian machines for words/integers larger than a byte.
e.g. on a big-endian machine a short integer of 2 bytes stores the 8 most significant bits in the first byte, the 8 least significant bits in the second byte. On a little-endian machine the 8 most significant bits will the second byte, the 8 least significant bits in the first byte.
So, if you write the memory content of such a short int directly to a file/network, the byte ordering within the short int will be different depending on the endianness.
UTF-8 is byte oriented, so there's not an issue regarding endianness. the first byte is always the first byte, the second byte is always the second byte etc. regardless of endianness.

To answer c): UTF-16 and UTF-32 represent characters as 16-bit or 32-bit words, so they are not byte-oriented.
For UTF-8, the smallest unit is a byte, thus it is byte-oriented. The alogrithm reads or writes one byte at a time. A byte is represented the same way on all machines.
For UTF-16, the smallest unit is a 16-bit word, and for UTF-32, the smallest unit is a 32-bit word. The algorithm reads or writes one word at a time (2 bytes, or 4 bytes). The order of the bytes in each word is different on big-endian and little-endian machines.

Why is a SHA-1 Hash 40 characters long if it is only 160 bit?

The title of the question says it all. I have been researching SHA-1 and most places I see it being 40 Hex Characters long which to me is 640bit. Could it not be represented just as well with only 10 hex characters 160bit = 20byte. And one hex character can represent 2 byte right? Why is it twice as long as it needs to be? What am I missing in my understanding.
And couldn't an SHA-1 be even just 5 or less characters if using Base32 or Base36 ?

One hex character can only represent 16 different values, i.e. 4 bits. (16 = 24)
40 × 4 = 160.
And no, you need much more than 5 characters in base-36.
There are totally 2160 different SHA-1 hashes.
2160 = 1640, so this is another reason why we need 40 hex digits.
But 2160 = 36160 log362 = 3630.9482..., so you still need 31 characters using base-36.

I think the OP's confusion comes from a string representing a SHA1 hash takes 40 bytes (at least if you are using ASCII), which equals 320 bits (not 640 bits).
The reason is that the hash is in binary and the hex string is just an encoding of that. So if you were to use a more efficient encoding (or no encoding at all), you could take only 160 bits of space (20 bytes), but the problem with that is it won't be binary safe.
You could use base64 though, in which case you'd need about 27-28 bytes (or characters) instead of 40 (see this page).

There are two hex characters per 8-bit-byte, not two bytes per hex character.
If you are working with 8-bit bytes (as in the SHA-1 definition), then a hex character encodes a single high or low 4-bit nibble within a byte. So it takes two such characters for a full byte.

My answer only differs from the previous ones in my theory as to the EXACT origin of the OP's confusion, and in the baby steps I provide for elucidation.
A character takes up different numbers of bytes depending on the encoding used (see here). There are a few contexts these days when we use 2 bytes per character, for example when programming in Java (here's why). Thus 40 Java characters would equal 80 bytes = 640 bits, the OP's calculation, and 10 Java characters would indeed encapsulate the right amount of information for a SHA-1 hash.
Unlike the thousands of possible Java characters, however, there are only 16 different hex characters, namely 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E and F. But these are not the same as Java characters, and take up far less space than the encodings of the Java characters 0 to 9 and A to F. They are symbols signifying all the possible values represented by just 4 bits:
0 0000 4 0100 8 1000 C 1100
1 0001 5 0101 9 1001 D 1101
2 0010 6 0110 A 1010 E 1110
3 0011 7 0111 B 1011 F 1111
Thus each hex character is only half a byte, and 40 hex characters gives us 20 bytes = 160 bits - the length of a SHA-1 hash.

2 hex characters mak up a range from 0-255, i.e. 0x00 == 0 and 0xFF == 255. So 2 hex characters are 8 bit, which makes 160 bit for your SHA digest.

SHA-1 is 160 bits
That translates to 20 bytes = 40 hex characters (2 hex characters per byte)

PNG file format endianness?

Im not sure if endian is the right word but..
I have been parsing through a PNG file and I have noticed that all of the integer values are in big endian. Is this true?
For example, the width and height are stored in the PNG file as 32bit unsigned integers. My image is 16x16 and in the file its stored as:
00 00 00 10
when it should be:
10 00 00 00
Is this true or is there something I am missing?

Yes, according to the specification, integers must be in network byte order (big endian):
All integers that require more than one byte shall be in network byte order: the most significant byte comes first, then the less significant bytes in descending order of significance (MSB LSB for two-byte integers, MSB B2 B1 LSB for four-byte integers). The highest bit (value 128) of a byte is numbered bit 7; the lowest bit (value 1) is numbered bit 0. Values are unsigned unless otherwise noted. Values explicitly noted as signed are represented in two's complement notation.
http://www.w3.org/TR/2003/REC-PNG-20031110/#7Integers-and-byte-order

Integers in PNG are in network byte order (big endian).
See: the spec.

Difference between Big Endian and little Endian Byte order

What is the difference between Big Endian and Little Endian Byte order ?
Both of these seem to be related to Unicode and UTF16. Where exactly do we use this?

Big-Endian (BE) / Little-Endian (LE) are two ways to organize multi-byte words. For example, when using two bytes to represent a character in UTF-16, there are two ways to represent the character 0x1234 as a string of bytes (0x00-0xFF):
Byte Index: 0 1
---------------------
Big-Endian: 12 34
Little-Endian: 34 12
In order to decide if a text uses UTF-16BE or UTF-16LE, the specification recommends to prepend a Byte Order Mark (BOM) to the string, representing the character U+FEFF. So, if the first two bytes of a UTF-16 encoded text file are FE, FF, the encoding is UTF-16BE. For FF, FE, it is UTF-16LE.
A visual example: The word "Example" in different encodings (UTF-16 with BOM):
Byte Index: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
------------------------------------------------------------
ASCII: 45 78 61 6d 70 6c 65
UTF-16BE: FE FF 00 45 00 78 00 61 00 6d 00 70 00 6c 00 65
UTF-16LE: FF FE 45 00 78 00 61 00 6d 00 70 00 6c 00 65 00
For further information, please read the Wikipedia page of Endianness and/or UTF-16.

Ferdinand's answer (and others) are correct, but incomplete.
Big Endian (BE) / Little Endian (LE) have nothing to do with UTF-16 or UTF-32.
They existed way before Unicode, and affect how the bytes of numbers get stored in the computer's memory. They depend on the processor.
If you have a number with the value 0x12345678 then in memory it will be represented as 12 34 56 78 (BE) or 78 56 34 12 (LE).
UTF-16 and UTF-32 happen to be represented on 2 respectively 4 bytes, so the order of the bytes respects the ordering that any number follows on that platform.

UTF-16 encodes Unicode into 16-bit values. Most modern filesystems operate on 8-bit bytes. So, to save a UTF-16 encoded file to disk, for example, you have to decide which part of the 16-bit value goes in the first byte, and which goes into the second byte.
Wikipedia has a more complete explanation.

little-endian: adj.
Describes a computer architecture in which, within a given 16- or 32-bit word, bytes at lower addresses have lower significance (the word is stored ‘little-end-first’). The PDP-11 and VAX families of computers and Intel microprocessors and a lot of communications and networking hardware are little-endian. The term is sometimes used to describe the ordering of units other than bytes; most often, bits within a byte.
big-endian: adj.
[common; From Swift's Gulliver's Travels via the famous paper On Holy Wars and a Plea for Peace by Danny Cohen, USC/ISI IEN 137, dated April 1, 1980]
Describes a computer architecture in which, within a given multi-byte numeric representation, the most significant byte has the lowest address (the word is stored ‘big-end-first’). Most processors, including the IBM 370 family, the PDP-10, the Motorola microprocessor families, and most of the various RISC designs are big-endian. Big-endian byte order is also sometimes called network order.
---from the Jargon File: http://catb.org/~esr/jargon/html/index.html

Byte endianness (big or little) needs to be specified for Unicode/UTF-16 encoding because for character codes that use more than a single byte, there is a choice of whether to read/write the most significant byte first or last. Unicode/UTF-16, since they are variable-length encodings (i.e. each char can be represented by one or several bytes) require this to be specified. (Note however that UTF-8 "words" are always 8-bits/one byte in length [though characters can be multiple points], therefore there is no problem with endianness.) If the encoder of a stream of bytes representing Unicode text and the decoder aren't agreed on which convention is being used, the wrong character code can be interpreted. For this reason, either the convention of endianness is known beforehand or more commonly a byte order mark is usually specified at the beginning of any Unicode text file/stream to indicate whethere big or little endian order is being used.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to encode data to remove any 0x00 bytes - encoding

Related

Oddity when encoding large integers using asn.1

Isn’t on big endian machines UTF-8's byte order different than on little endian machines? So why then doesn’t UTF-8 require a BOM?

Why is a SHA-1 Hash 40 characters long if it is only 160 bit?

PNG file format endianness?

Difference between Big Endian and little Endian Byte order

Categories

Resources