Kdb+ data fomat when writing to a file - kdb

I'm trying to understand what happen when someone try to write to a file the next way:
q)h:hopen `:out
q)h (1 2;3)
3i
q)hclose h
q)read1 `:out
0x07000200000001000000000000000200000000000000f90300000000000000
this is not the same as a binary representation:
q)-8!(1 2;3)
0x010000002d00000000000200000007000200000001000000000000000200000000000000f90300000000000000
in which data format (1 2;3) was written into :out?
is there a way to read data from it back? - like -9!-8!(1 2;3)
how set/get related to these binary data formats? - do they use another 3-rd different binary data format?

Technically you can read the object if you have a little prior knowledge about the object, such that you can fabricate a header:
q)read1`:out
0x07000200000001000000000000000200000000000000f90300000000000000
q)-9!read1`:out
'badmsg
[0] -9!read1`:out
^
q)-9!{0x01,0x000000,(reverse 0x0 vs `int$count[x]+1+3+4+1+1+4),0x00,0x00,(reverse 0x0 vs 2i),x}read1`:out
1 2
3
The header here is comprised of:
0x01 - little endian
0x000000 - filler
message length (count of raw `x` plus the header additions)
0x00 - type (generic list = type 0) ... you have to know this in advance
0x00 - attributes .... none here, you would have to know this
length of list .... here we knew it was a 2-item list
x - the raw bytecode of the object without header
As rianoc pointed out, there are better ways to write such objects such that they can be more easily read without requiring advanced knowledge

-8! returns the IPC byte representation.
On disk the binary format is different. To read this data use get.
To stream data to a log file you need to create the empty file first
https://code.kx.com/q/kb/replay-log/#replaying-log-files
q)`:out set () /This important part adds a header to the file to set the type
`:out
q)h:hopen `:out
q)h (1 2;3)
600i
q)hclose h
q)get `:out
1 2
3
Note if you wish to write your item as a single element to the list then use enlist
q)`:out set ()
`:out
q)h:hopen `:out
q)h enlist (1 2;3)
616i
q)hclose h
q)get `:out
1 2 3
Binary and text data can also be written to a file which is what you are doing.
https://code.kx.com/q/ref/hopen/#files
The intention is to write specific pieces of data
q)h 0x2324 /Write some binary
q)h "some text\n" /Write some text
In your code the binary representation of the raw KX object does get written but no header is added (it is neither IPC nor disk format). So when you read back the data it cannot be interpreted correctly by either -9! or get.
Valid file binary created with `:out set () has a file header:
(Can be read by get)
q)read1 `:out
0xff0100000000000000000200000007000200000001000000000000000200000000000000f90300000000000000
Valid IPC binary with IPC header:
(Can be read by -9!)
q)-8!(1 2;3)
0x010000002d00000000000200000007000200000001000000000000000200000000000000f90300000000000000
Your raw object in binary - no header present to enable interpretation
q)read1 `:out
0x07000200000001000000000000000200000000000000f90300000000000000

Related

How to get the size of DER structure?

I need to learn the length of Der structure.
It's used as a header for the cipher text file. My cipher writes the DER encoded data and the ciphertext back to back to the (cipher text) file.
I need to learn the size of the DER structure so I can pass it and only get the ciphertext from the cipher text file for decoding it. I know, I need to parse the length byte (or bytes) of header's outer asn1 sequence to get that info, but I don't know how to do it since I am not sure how many bytes it takes to store that length data.
I put the DER Header down below to give a basic idea. I would appreciate if you can take a look on it.
Header = \
asn1_sequence(
asn1_sequence(
asn1_octetstring(salt)+
asn1_integer(iter)+
asn1_integer(len(key))
) +
asn1_sequence(
asn1_objectidentifier([2,16,840,1,101,3,4,1,2])+
asn1_octetstring(iv_current)
)+
asn1_sequence(
asn1_sequence(
asn1_objectidentifier([2,16,840,1,101,3,4,2,1])+
asn1_null()
)+
asn1_octetstring(digestinfo)
)
)
Your data is encoded with DER, this means it uses TLV (Tag Length Value) form.
To know the length of the Value, you will have to read the Tag and the Length.
Reading Tag and Length is not trivial and it is explained in Recommendation X.690
8.1.2 Identifier octets
8.1.3 Length octets
Depending on which language you want to use, you should be able to find some code to use or hack.
Example in Java
Read a tag
Read a length
This lib would be used like this
BERReader reader = new BERReader(input);
reader.readTag();
reader.readLength();
reader.getLengthValue(); // gives you how many bytes you have to read

Why does Matlab dbf-reader read certain integers wrong?

I use the matlab dbf reader available
here
I've noticed that three digit integers some times are read wrong.
Original data from dbf-file:
LAMAX,DTLD,1,599,727Q9,A,STANDARD,1,18,18,0,2359.5
But looking at the data in Matlab you see that 599 becomes 995.
Why is that?
'LAMAX','DTLD',[1],[995],'727Q9','A','STANDARD','1','18','18','0',
[2.3595e+03]
This is how I read the dbf file with matlab code
[dbfData, NAMES] = dbfread(path2file);
where dbfData is the actual data and NAMES are the field names in the dbf-file.
EDIT:
The dbf-file was created with INM
When I open the dbf file using OpenOffice the headers look like this
METRIC_ID,C,6 ; GRID_ID,C,8I_INDEX,N,3,0 ; J_INDEX,N,3,0 ; ACFT_ID,C,12 ; OP_TYPE,C,1 ; PROF_ID1,C,8 ; PROF_ID2,C,1 ; RWY_ID,C,8 ; TRK_ID1,C,8 ; TRK_ID2,C,1 ; DISTANCE,N,9,1
The distorted integers are stored with 3 digits numbers without decimals J_INDEX,N,3,0
Have you used the updated version of STR2DOUBLE2CELL?
From the link above:
STR2DOUBLE2CELL subfunction sometimes works incorrectly if number of digits in the input parameter is different

How to truncate a 2's complement output

I have data written into short data type. The data written is of 2's complement form.
Now when I try to print the data using %04x, the data with MSB=0 is printed fine for eg if data=740, print I get is 0740
But when the MSB=1, I am unable to get a proper print. For eg if data=842, print I get is fffff842
I want the data truncated to 4 bytes so expected output is f842
Either declare your data as a type which is 16 bits long, or make sure the printing function uses the right format for 16 bits value. Or use your current type, but do a bitwise AND with 0xffff. What you can do depends on the language you're doing it in really.
But whichever way you go, check your assumptions again. There seems to be a few issues in your question:
2s-complement applies to signed numbers only. There are no negative numbers in your question.
Assuming you mean C's short - it doesn't have to be 16 bits long.
"I get is fffff842 I want the data truncated to 4 bytes" - fffff842 is 4 bytes long. f842 is 2 bytes long.
2-bytes long value 842 does not have the MSB set.
I'm assuming C (or possibly C++) as the language here.
Because of the default argument promotions involved when calling a variable argument function (such as printf), your use of a short will result in an integer promotion, which states that "If an int can represent all values of the original type (as restricted by the width, for a
bit-field), the value is converted to an int".
A short is converted to an int by means of sign-extension, and 0xf842 sign-extended to 32 bits is 0xfffff842.
You can use a bitwise AND to mask off the most significant word:
printf("%04x", data & 0xffff);
You could also add the h length specifier to state that you only want to print an (unsigned) short worth of bits from an int:
printf("%04hx", data);

The torrent info_hash parameter

How does one calculate the info_hash parameter? Aka the hash corresponding to the info dictionar??
From official specs:
info_hash
The 20 byte sha1 hash of the bencoded form of the info value from the metainfo file. Note that this is a substring of the metainfo file.
This value will almost certainly have to be escaped.
Does this mean simply get the substring from the meta-info file and do a sha-1 hash on the reprezentative bytes??
.... because this is how i tried 12 times but without succes meaning I have compared the resulting hash with the one i should end up with..and they differ ..that + tracker response is FAILURE, unknown torrent ...or something
So how do you calculate the info_hash?
The metafile is already bencoded so I don't understand why you encode it again?
I finally got this working in Java code, here is my code:
byte metaData[]; //the raw .torrent file
int infoIdx = ?; //index of 'd' right after the "4:info" string
info_hash = SHAsum(Arrays.copyOfRange(metaData, infoIdx, metaData.length-1));
This assumes the 'info' block is the last block in the torrent file (wrong?)
Don't sort or anything like that, just use a substring of the raw torrent file.
Works for me.
bdecode the metafile. Then it's simply sha1(bencode(metadata['info']))
(i.e. bencode only the info dict again, then hash that).

How can I convert the tiger hash values from the official implementations into the form used by Direct Connect?

I am trying to implement a Direct Connect Client, and I am currently stuck at a point where I need to hash the files in order to be able to upload them to other clients.
As the all other clients require a TTHL (Tiger Tree Hashing Leaves) support for verification of the downloaded data. I have searched for implementations of the algorithm, and found tiger-hash-python.
I have implemented a routine that uses the hash function from before, and is able to hash large files, according to the logic specified in Tree Hash EXchange format (THEX) (basically, the tree diagram is the important part on that page).
However, the value produced by it is similar to those shown on Wikipedia, a hex digest, but is different from those shown in the DC clients I'm using for reference.
I have been unable to find out how the hex digest form is converted to this other one (39 characters, A-Z, 0-9). Could someone please explain how that is done?
Well ... I tried what Paulo Ebermann said, using the following functions:
def strdivide(list,length):
result = []
# Calculate how many blocks there are, using the condition: i*length = len(list).
# The additional maths operations are to deal with the last block which might have a smaller size
for i in range(0,int(math.ceil(float(len(list))/length))):
result.append(list[i*length:(i+1)*length])
return result
def dchash(data):
result = tiger.hash(data) # From the aformentioned tiger-hash-python script, 48-char hex digest
result = "".join([ "".join(strdivide(result[i:i+16],2)[::-1]) for i in range(0,48,16) ]) # Representation Transform
bits = "".join([chr(int(c,16)) for c in strdivide(result,2)]) # Converting every 2 hex characters into 1 normal
result = base64.b32encode(bits) # Result will be 40 characters
return result[:-1] # Leaving behind the trailing '='
The TTH for an empty file was found to be 8B630E030AD09E5D0E90FB246A3A75DBB6256C3EE7B8635A, which after the transformation specified here, becomes 5D9ED00A030E638BDB753A6A24FB900E5A63B8E73E6C25B6. Base-32 encoding this result yielded LWPNACQDBZRYXW3VHJVCJ64QBZNGHOHHHZWCLNQ, which was found to be what DC++ generates.
The only mention of the format of the hash in the Direct Connect protocol I found is on the $SR page on the NMDC Protocol wiki:
For files containing TTH, the <hub_name> parameter is replaced with TTH:<base32_encoded_tth_hash> (ref: TTH_Hash).
So, it is Base32-encoding. This is defined in RFC 4648 (and some earlier ones), section 6.
Basically, you are using the capital letters A-Z and the decimal digits 2 to 7, and one base32 digit represents 5 bits, while one base16 (hexadecimal) digit represents only 4 ones.
This means, each 5 hex digits map to 4 base32-digits, and for a Tiger hash (192 bits) you will need 40 base32-digits (in the official encoding, the last one would be a = padding, which seems to be omitted if you say that there are always 39 characters).
I'm not sure of an implementation of a conversion from hex (or bytes) to base32, but it shouldn't be too complicated with a lookup table and some bit-shifting.