Scala- How can I read some specific bytes from a file? - scala

I'd like to encrypt a text(about 1 MB) and I use the max length of RSA keys(4096 bits). However, the key seems too short. As I googled, I got to know that the max size of text that a RSA can encrypt is 8 bytes shorter than the length of the key. Thus, I can only encrypt 501 bytes in this way. So I decided to divide my text into 2093 arrays (1024*1024/501=2092.1).The question is how can I pour the first 501 bytes into the first array in scala?Anyone who can help me this out?

I can't comment on whether your cryptographic approach is okay. (I don't know, but would rely on libraries written and vetted by more knowledgeable cryptographers if I were in your shoes. I'm not sure why you choose 501, which is 11 bytes, not 8, shorter than 512.)
But chunking your arrays into fixed-size blocks should be easy. Just use the grouped function f Array.
val text : String = ???
val bytes = text.getBytes( scala.io.Codec.UTF8.charSet ) // lots of ways to do this
val blocks = bytes.grouped( 501 )
Blocks will be an Iterator[Array[Byte]], each 501 bytes long except for the last (which may be shorter).

Related

How to get the size of DER structure?

I need to learn the length of Der structure.
It's used as a header for the cipher text file. My cipher writes the DER encoded data and the ciphertext back to back to the (cipher text) file.
I need to learn the size of the DER structure so I can pass it and only get the ciphertext from the cipher text file for decoding it. I know, I need to parse the length byte (or bytes) of header's outer asn1 sequence to get that info, but I don't know how to do it since I am not sure how many bytes it takes to store that length data.
I put the DER Header down below to give a basic idea. I would appreciate if you can take a look on it.
Header = \
asn1_sequence(
asn1_sequence(
asn1_octetstring(salt)+
asn1_integer(iter)+
asn1_integer(len(key))
) +
asn1_sequence(
asn1_objectidentifier([2,16,840,1,101,3,4,1,2])+
asn1_octetstring(iv_current)
)+
asn1_sequence(
asn1_sequence(
asn1_objectidentifier([2,16,840,1,101,3,4,2,1])+
asn1_null()
)+
asn1_octetstring(digestinfo)
)
)
Your data is encoded with DER, this means it uses TLV (Tag Length Value) form.
To know the length of the Value, you will have to read the Tag and the Length.
Reading Tag and Length is not trivial and it is explained in Recommendation X.690
8.1.2 Identifier octets
8.1.3 Length octets
Depending on which language you want to use, you should be able to find some code to use or hack.
Example in Java
Read a tag
Read a length
This lib would be used like this
BERReader reader = new BERReader(input);
reader.readTag();
reader.readLength();
reader.getLengthValue(); // gives you how many bytes you have to read

How to truncate a 2's complement output

I have data written into short data type. The data written is of 2's complement form.
Now when I try to print the data using %04x, the data with MSB=0 is printed fine for eg if data=740, print I get is 0740
But when the MSB=1, I am unable to get a proper print. For eg if data=842, print I get is fffff842
I want the data truncated to 4 bytes so expected output is f842
Either declare your data as a type which is 16 bits long, or make sure the printing function uses the right format for 16 bits value. Or use your current type, but do a bitwise AND with 0xffff. What you can do depends on the language you're doing it in really.
But whichever way you go, check your assumptions again. There seems to be a few issues in your question:
2s-complement applies to signed numbers only. There are no negative numbers in your question.
Assuming you mean C's short - it doesn't have to be 16 bits long.
"I get is fffff842 I want the data truncated to 4 bytes" - fffff842 is 4 bytes long. f842 is 2 bytes long.
2-bytes long value 842 does not have the MSB set.
I'm assuming C (or possibly C++) as the language here.
Because of the default argument promotions involved when calling a variable argument function (such as printf), your use of a short will result in an integer promotion, which states that "If an int can represent all values of the original type (as restricted by the width, for a
bit-field), the value is converted to an int".
A short is converted to an int by means of sign-extension, and 0xf842 sign-extended to 32 bits is 0xfffff842.
You can use a bitwise AND to mask off the most significant word:
printf("%04x", data & 0xffff);
You could also add the h length specifier to state that you only want to print an (unsigned) short worth of bits from an int:
printf("%04hx", data);

How can I convert the tiger hash values from the official implementations into the form used by Direct Connect?

I am trying to implement a Direct Connect Client, and I am currently stuck at a point where I need to hash the files in order to be able to upload them to other clients.
As the all other clients require a TTHL (Tiger Tree Hashing Leaves) support for verification of the downloaded data. I have searched for implementations of the algorithm, and found tiger-hash-python.
I have implemented a routine that uses the hash function from before, and is able to hash large files, according to the logic specified in Tree Hash EXchange format (THEX) (basically, the tree diagram is the important part on that page).
However, the value produced by it is similar to those shown on Wikipedia, a hex digest, but is different from those shown in the DC clients I'm using for reference.
I have been unable to find out how the hex digest form is converted to this other one (39 characters, A-Z, 0-9). Could someone please explain how that is done?
Well ... I tried what Paulo Ebermann said, using the following functions:
def strdivide(list,length):
result = []
# Calculate how many blocks there are, using the condition: i*length = len(list).
# The additional maths operations are to deal with the last block which might have a smaller size
for i in range(0,int(math.ceil(float(len(list))/length))):
result.append(list[i*length:(i+1)*length])
return result
def dchash(data):
result = tiger.hash(data) # From the aformentioned tiger-hash-python script, 48-char hex digest
result = "".join([ "".join(strdivide(result[i:i+16],2)[::-1]) for i in range(0,48,16) ]) # Representation Transform
bits = "".join([chr(int(c,16)) for c in strdivide(result,2)]) # Converting every 2 hex characters into 1 normal
result = base64.b32encode(bits) # Result will be 40 characters
return result[:-1] # Leaving behind the trailing '='
The TTH for an empty file was found to be 8B630E030AD09E5D0E90FB246A3A75DBB6256C3EE7B8635A, which after the transformation specified here, becomes 5D9ED00A030E638BDB753A6A24FB900E5A63B8E73E6C25B6. Base-32 encoding this result yielded LWPNACQDBZRYXW3VHJVCJ64QBZNGHOHHHZWCLNQ, which was found to be what DC++ generates.
The only mention of the format of the hash in the Direct Connect protocol I found is on the $SR page on the NMDC Protocol wiki:
For files containing TTH, the <hub_name> parameter is replaced with TTH:<base32_encoded_tth_hash> (ref: TTH_Hash).
So, it is Base32-encoding. This is defined in RFC 4648 (and some earlier ones), section 6.
Basically, you are using the capital letters A-Z and the decimal digits 2 to 7, and one base32 digit represents 5 bits, while one base16 (hexadecimal) digit represents only 4 ones.
This means, each 5 hex digits map to 4 base32-digits, and for a Tiger hash (192 bits) you will need 40 base32-digits (in the official encoding, the last one would be a = padding, which seems to be omitted if you say that there are always 39 characters).
I'm not sure of an implementation of a conversion from hex (or bytes) to base32, but it shouldn't be too complicated with a lookup table and some bit-shifting.

Why does a base64 encoded string have an = sign at the end

I know what base64 encoding is and how to calculate base64 encoding in C#, however I have seen several times that when I convert a string into base64, there is an = at the end.
A few questions came up:
Does a base64 string always end with =?
Why does an = get appended at the end?
Q Does a base64 string always end with =?
A: No. (the word usb is base64 encoded into dXNi)
Q Why does an = get appended at the end?
A: As a short answer:
The last character (= sign) is added only as a complement (padding) in the final process of encoding a message with a special number of characters.
You will not have an = sign if your string has a multiple of 3 characters, because Base64 encoding takes each three bytes (a character=1 byte) and represents them as four printable characters in the ASCII standard.
Example:
(a) If you want to encode
ABCDEFG <=> [ABC] [DEF] [G]
Base64 deals with the first block (producing 4 characters) and the second (as they are complete). But for the third, it will add a double == in the output in order to complete the 4 needed characters. Thus, the result will be QUJD REVG Rw== (without spaces).
[ABC] => QUJD
[DEF] => REVG
[G] => Rw==
(b) If you want to encode ABCDEFGH <=> [ABC] [DEF] [GH]
similarly, it will add one = at the end of the output to get 4 characters.
The result will be QUJD REVG R0g= (without spaces).
[ABC] => QUJD
[DEF] => REVG
[GH] => R0g=
It serves as padding.
A more complete answer is that a base64 encoded string doesn't always end with a =, it will only end with one or two = if they are required to pad the string out to the proper length.
From Wikipedia:
The final '==' sequence indicates that the last group contained only one byte, and '=' indicates that it contained two bytes.
Thus, this is some sort of padding.
Its defined in RFC 2045 as a special padding character if fewer than 24 bits are available at the end of the encoded data.
No.
To pad the Base64-encoded string to a multiple of 4 characters in length, so that it can be decoded correctly.
The equals sign (=) is used as padding in certain forms of base64 encoding. The Wikipedia article on base64 has all the details.
It's padding. From http://en.wikipedia.org/wiki/Base64:
In theory, the padding character is not needed for decoding, since the
number of missing bytes can be calculated from the number of Base64
digits. In some implementations, the padding character is mandatory,
while for others it is not used. One case in which padding characters
are required is concatenating multiple Base64 encoded files.
http://www.hcidata.info/base64.htm
Encoding "Mary had" to Base 64
In this example we are using a simple text string ("Mary had") but the principle holds no matter what the data is (e.g. graphics file). To convert each 24 bits of input data to 32 bits of output, Base 64 encoding splits the 24 bits into 4 chunks of 6 bits. The first problem we notice is that "Mary had" is not a multiple of 3 bytes - it is 8 bytes long. Because of this, the last group of bits is only 4 bits long. To remedy this we add two extra bits of '0' and remember this fact by putting a '=' at the end. If the text string to be converted to Base 64 was 7 bytes long, the last group would have had 2 bits. In this case we would have added four extra bits of '0' and remember this fact by putting '==' at the end.
= is a padding character. If the input stream has length that is not a multiple of 3, the padding character will be added. This is required by decoder: if no padding present, the last byte would have an incorrect number of zero bits.
Better and deeper explanation here: https://base64tool.com/detect-whether-provided-string-is-base64-or-not/
The equals or double equals serves as padding. It's a stupid concept defined in RFC2045 and it is actually superfluous. Any decend parser can encode and decode a base64 string without knowing about padding by just counting up the number of characters and filling in the rest if size isn't dividable by 3 or 4 respectively. This actually leads to difficulties every now and then, because some parsers expect padding while others blatantly ignore it. My MPU base64 decoder for example needs padding, but it receives a non-padded base64 string over the network. This leads to erronous parsing and I had to account for it myself.

MD5/SHA "update" property?

What is the MD5/SHA property that allows you to "update" them? For example, if you have the hash for "test" you can add "case" to get the hash for "testcase". I would like to read up on this property a bit but my searches turn up nothing...
It is merely that they are actually calculated incrementally -- you calculate them by operating on the first n bytes of data, (128 in the case of MD5, see http://en.wikipedia.org/wiki/MD5#Algorithm), then on the next n bytes of data, etc.
EDIT: This isn't even theoretically possible, due to the 1-bit padding I mention below. In effect, md5("case", seed=md5("test")) == md5("test" + <1-bit> + "case"). There is no way to use md5("test") to incrementally compute md5("test" + "case").
This is theoretically possible if you concatenate 512-bit chunks. It won't work for appending "case" to "test", because the first run of the state machine is polluted by the padding used to turn "case" into a 512-bit chunk.
Additionally, the padding isn't just a bunch of zeros. The message is always first padded with a 1 bit, so that "case" and "case\0" produce different hashes. Thus you can't rely on "case" having the same hash with or without padding.
The MD5 algorithm has the following steps:
1) pad input string to a multiple of 64 bytes
2) split input string into blocks of 64 bytes
3) initialise state (a 4-element array)
4) for each block: state <= transform(state,block)
5) encode state as string
To support situations where you want to hash something in stages (e.g. large files), this can be refactored as follows.
Initialise:
1) initialise state
2) leftover bytes <= ""
Update:
1) append leftover bytes to start of input string
2) split input string into blocks of 64 bytes
3) for each complete block: state <= transform(state,block)
4) leftover bytes <= contents of the incomplete block, if one exists
Digest:
1) pad a copy of the leftover bytes
2) split the padded leftover bytes into blocks of 64 bytes
2) tmp_state <= state
2) for each block: tmp_state <= transform(tmp_state,block)
3) encode tmp_state as string
I've actually implemented this approach in VBA - it seems to work fine. Any suggestions for where I should upload the code?