How do I check if a file is *mostly* identical with another? - powershell

I need to use Powershell to check if two files are the same but with the following restriction: there are eight specific bytes in the first 2K that are allowed to be different (if you're interested, it's certain timestamp bytes in the superblock of an ext4 image).
The code I found on Stack Overflow (obviously) for doing full checks is as follows:
$md5 = New-Object -TypeName System.Security.Cryptography.MD5CryptoServiceProvider
$hash = [System.BitConverter]::ToString(
$md5.ComputeHash([System.IO.File]::ReadAllBytes("fspec.bin")))
This gives me the hash of the entire file but what I really need is:
the first 2K of the file as a byte array so I can check specifics; and
the checksum of the remainder of the file to check equality.
The System.IO.File class has ReadAllBytes but does not appear to have the capacity to read a section of the file, nor seek to a specific place.
I have attempted to read in the byte array and use array slicing to get the parts as follows:
$restOfFile = [System.IO.File]::ReadAllBytes("fspec")
$firstTwoK = $restOfFile[0..2048]
$restOfFile = $restOfFile[2048..$restOfFile.Length]
# Then:
# 1. Check bytes in firstTwoK.
# 2. Check MD5 of all bytes in restOfFile.
Unfortunately, the fact that it's a 750M file is causing problems:
Array dimensions exceeded supported range.
At C:\testprog\testprog.ps1:42 char:1
+ ${devBytes} = ${devBytes}[2048..${devBytes}.Length]
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : OperationStopped: (:) [], OutOfMemoryException
+ FullyQualifiedErrorId : System.OutOfMemoryException
Is there a functional way to do what I need?

Use one of the derived types of System.Security.Cryptography.HashAlgorithm and use its ComputeHash method to specify an offset. For checking file uniqueness, MD5 is still fine to use, though you can use a stronger algorithm if you choose as well:
$fileBytes = [System.File.IO]::ReadAllBytes("C:\path\to\file.ext")
$md5Cng = [System.Security.Cryptography.MD5Cng]::Create()
$fileHashAfterOffset = $md5Cng.ComputeHash( $fileBytes, 2KB, $fileBytes.length - 2KB )
The first argument of ComputeHash is the file as a Byte[]. The second argument is the offset (e.g. don't include the first x bytes when generating the hash), and the third argument is how many bytes you want to evaluate. In this case, we want the rest of the file, so we take the total number of bytes in the $fileBytes array and subtract the offset from it.
Using 2KB is shorthand to get the number of bytes in 2 kilobytes.

Related

perl: utf8 <something> does not map to Unicode while <something> doesn't seem to be present present

I'm using MARC::Lint to lint some MARC records, but every now and them I'm getting an error (on about 1% of the files):
utf8 "\xCA" does not map to Unicode at /usr/lib/x86_64-linux-gnu/perl/5.26/Encode.pm line 212.
The problem is that I've tried different methods but cannot find "\xCA" in the file...
My script is:
#!perl -w
use MARC::File::USMARC;
use MARC::Lint;
use utf8;
use open OUT => ':utf8';
my $lint = new MARC::Lint;
my $filename = shift;
my $file = MARC::File::USMARC->in( $filename );
while ( my $marc = $file->next() ) {
$lint->check_record( $marc );
# Print the errors that were found
print join( "\n", $lint->warnings ), "\n";
} # while
and the file can be downloaded here: http://eroux.fr/I14376.mrc
Is "\xCA" hidden somewhere? Or is this a bug in MARC::Lint?
The problem has nothing to do with MARC::Lint. Remove the lint check, and you'll still get the error.
The problem is a bad data file.
The file contains a "directory" of where the information is located in the file. The following is a human-readable rendition of the directory for the file you provided:
tagno|offset|len # Offsets are from the start of the data portion.
001|00000|0017 # Length include the single-byte field terminator.
006|00017|0019 # Offset and lengths are in bytes.
007|00036|0015
008|00051|0041
035|00092|0021
035|00113|0021
040|00134|0018
050|00152|0022
066|00174|0009
245|00183|0101
246|00284|0135
264|00419|0086
300|00505|0034
336|00539|0026
337|00565|0026
338|00591|0036
546|00627|0016
500|00643|0112
505|00755|9999 <--
506|29349|0051
520|29400|0087
533|29487|0115
542|29602|0070
588|29672|0070
653|29742|0013
710|29755|0038
720|29793|0130
776|29923|0066
856|29989|0061
880|30050|0181
880|30231|0262
Notice the length of the field with tag 505, 9999. This is the maximum value supported (because the length is stored as four decimal digits). The catch is that value of that field is far larger than 9,999 bytes; it's actually 28,594 bytes in size.
What happens is that the module extracts 9,999 bytes rather than 28,594. This happens to cut a UTF-8 sequence in half. (The specific sequence is CA BA, the encoding of ʼ.) Later, when the module attempts to decode that text, an error is thrown. (CA must be followed by another byte to be valid.)
Are these records you are generating? If so, you need to make sure that no field requires more than 9,999 bytes.
Still, the module should handle this better. It could read until it finds a end-of-field marker instead of using the length when it finds no end-of-field marker where it expects one and/or it could handle decoding errors in a non-fatal manner. It already has a mechanism to report these problems ($marc->warnings).
In fact, if it hadn't died (say if the cut happened to occur in between characters instead of in the middle of one), $marc->warnings would have returned the following message:
field does not end in end of field character in tag 505 in record 1

Powershell write value to serial port

How can I write the value 255 to the serial port in Powershell ?
$port= new-Object System.IO.Ports.SerialPort COM6,4800,None,8,one
$port.open()
$port.Write([char]255)
$port.Close()
The output of the previous script is 63 (viewed with a serial port monitor).
However $port.Write([char]127) gives 127 as result. If the value is higher than 127 the output is always 63.
Thanks in advance for your help !
Despite your attempt to use [char], your argument is treated as [string], because PowerShell chooses the following overload of the Write method, given that you're only passing a single argument:
void Write(string text)
The documentation for this particular overload states (emphasis added):
By default, SerialPort uses ASCIIEncoding to encode the characters. ASCIIEncoding encodes all characters greater than 127 as (char)63 or '?'. To support additional characters in that range, set Encoding to UTF8Encoding, UTF32Encoding, or UnicodeEncoding.
To send byte values, you must use the following overload:
void Write(byte[] buffer, int offset, int count)
That requires you to:
use cast [byte[]] to cast your byte value(s)
and specify values for offset - the starting byte position as well as count, the number of bytes to copy from the starting byte position.
In your case:
$port.Write([byte[]] (255), 0, 1)
Note: The (...) around 255 isn't required for a single value, but would be necessary to specify multiple, ,-separated values.
Note:
If you want to send entire strings, and those strings include characters outside the ASCII range, you'll need to set the port's character encoding first, as shown in this answer, which also shows an alternative solution based on obtaining a byte-array representation of a string based on the desired encoding (which then allows you to use the same method overload as above).
Try something like this:
$port1 = new-Object System.IO.Ports.SerialPort COM1,4800,None,8,one
$port1.Open()
$data = [System.Text.Encoding]::UTF32.GetBytes([char]255)
$port1.Write( $data )
$port1.ReadExisting()
$port1
$port1.Close()
Should work.

How should the 'System Use' field be interpreted in a 'Directory Record'?

In the ECMA 119 specifications (freely available here), I am trying to understand how to fetch the content of the System Use field:
How is one supposed to compute the length of the System Use field, i.e. how is the value of the LEN_SU found in the left column ?
The value of LEN_SU is given implicitly. From BP1 you know the total number of bytes in the directory record (LEN_DR). LEN_SU is then given (implicitly) as the bytes remaining in the directory record after 33+LEN_FI+possible_padding(1), where you get length LEN_FI from BP33.
Hence
LEN_SU = LEN_DR - (33+LEN_FI+possible_padding(1))
From the spec.:
Padding Field [BP (34 + LEN_FI)]
This field shall be present in the
Directory Record only if the number in the Length of the File
Identifier field is an even number.
System Use [BP (LEN_DR - LEN_SU + 1) to LEN_DR)
This field shall be
optional. If present, this field shall be reserved for system use. Its
content is not specified by this Standard. If necessary, so that the
Directory Record comprises an even number of bytes, a (00) byte shall
be added to terminate this field.

The torrent info_hash parameter

How does one calculate the info_hash parameter? Aka the hash corresponding to the info dictionar??
From official specs:
info_hash
The 20 byte sha1 hash of the bencoded form of the info value from the metainfo file. Note that this is a substring of the metainfo file.
This value will almost certainly have to be escaped.
Does this mean simply get the substring from the meta-info file and do a sha-1 hash on the reprezentative bytes??
.... because this is how i tried 12 times but without succes meaning I have compared the resulting hash with the one i should end up with..and they differ ..that + tracker response is FAILURE, unknown torrent ...or something
So how do you calculate the info_hash?
The metafile is already bencoded so I don't understand why you encode it again?
I finally got this working in Java code, here is my code:
byte metaData[]; //the raw .torrent file
int infoIdx = ?; //index of 'd' right after the "4:info" string
info_hash = SHAsum(Arrays.copyOfRange(metaData, infoIdx, metaData.length-1));
This assumes the 'info' block is the last block in the torrent file (wrong?)
Don't sort or anything like that, just use a substring of the raw torrent file.
Works for me.
bdecode the metafile. Then it's simply sha1(bencode(metadata['info']))
(i.e. bencode only the info dict again, then hash that).

How can I convert the tiger hash values from the official implementations into the form used by Direct Connect?

I am trying to implement a Direct Connect Client, and I am currently stuck at a point where I need to hash the files in order to be able to upload them to other clients.
As the all other clients require a TTHL (Tiger Tree Hashing Leaves) support for verification of the downloaded data. I have searched for implementations of the algorithm, and found tiger-hash-python.
I have implemented a routine that uses the hash function from before, and is able to hash large files, according to the logic specified in Tree Hash EXchange format (THEX) (basically, the tree diagram is the important part on that page).
However, the value produced by it is similar to those shown on Wikipedia, a hex digest, but is different from those shown in the DC clients I'm using for reference.
I have been unable to find out how the hex digest form is converted to this other one (39 characters, A-Z, 0-9). Could someone please explain how that is done?
Well ... I tried what Paulo Ebermann said, using the following functions:
def strdivide(list,length):
result = []
# Calculate how many blocks there are, using the condition: i*length = len(list).
# The additional maths operations are to deal with the last block which might have a smaller size
for i in range(0,int(math.ceil(float(len(list))/length))):
result.append(list[i*length:(i+1)*length])
return result
def dchash(data):
result = tiger.hash(data) # From the aformentioned tiger-hash-python script, 48-char hex digest
result = "".join([ "".join(strdivide(result[i:i+16],2)[::-1]) for i in range(0,48,16) ]) # Representation Transform
bits = "".join([chr(int(c,16)) for c in strdivide(result,2)]) # Converting every 2 hex characters into 1 normal
result = base64.b32encode(bits) # Result will be 40 characters
return result[:-1] # Leaving behind the trailing '='
The TTH for an empty file was found to be 8B630E030AD09E5D0E90FB246A3A75DBB6256C3EE7B8635A, which after the transformation specified here, becomes 5D9ED00A030E638BDB753A6A24FB900E5A63B8E73E6C25B6. Base-32 encoding this result yielded LWPNACQDBZRYXW3VHJVCJ64QBZNGHOHHHZWCLNQ, which was found to be what DC++ generates.
The only mention of the format of the hash in the Direct Connect protocol I found is on the $SR page on the NMDC Protocol wiki:
For files containing TTH, the <hub_name> parameter is replaced with TTH:<base32_encoded_tth_hash> (ref: TTH_Hash).
So, it is Base32-encoding. This is defined in RFC 4648 (and some earlier ones), section 6.
Basically, you are using the capital letters A-Z and the decimal digits 2 to 7, and one base32 digit represents 5 bits, while one base16 (hexadecimal) digit represents only 4 ones.
This means, each 5 hex digits map to 4 base32-digits, and for a Tiger hash (192 bits) you will need 40 base32-digits (in the official encoding, the last one would be a = padding, which seems to be omitted if you say that there are always 39 characters).
I'm not sure of an implementation of a conversion from hex (or bytes) to base32, but it shouldn't be too complicated with a lookup table and some bit-shifting.