Hash Calculation of a single file within disk image - hash

I have got an assignment to calculate hash of a file from its disk image and then match it with simple pdf version hash. I have calculated starting and ending addresses of file from data section of FAT32 following FAT table linked list implementation. Now is there any utility or software available to which I input disk image file and starting and ending addresses and it outputs hash value of specified data?

Hex Editor's select block option worked for me.

Related

Why are there junk "files" in my Fat32 SD rood dir

I have been building a program to access files on a Fat32 micro sd card. However, when I run my program or open my SD card with a hex viewer to the root directory I can see my files but there are also a bunch of junk mixed in, why is that?
Directory data is organized in 32 byte records. This is nice, because any sector holds
exactly 16 records, and no directory record will ever cross a sector boundry. There
are four types of 32-byte directory records.
Normal record with short filename - Attrib is normal
Long filename text - Attrib has all four type bits set
Unused - First byte is 0xE5
End of directory - First byte is zero
Unused directory records are a result of deleting files. The first byte is overwritten
with 0xE5, and later when a new file is created it can be reused. At the end of the directory is a record that begins with zero. All other records will be non-zero in their
first byte, so this is an easy way to determine when you have reached the end of the
directory.
Records that do not begin with 0xE5 or zero are actual directory data, and the format
can be determined by checking the Attrib byte. For now, we are only going to be
concerned with the normal directory records that have the old 8.3 short filename
format. In FAT32, all files and subdirectories have short names, even if the user gave
the file a longer name, so you can access all files without needing to decode the long
filename records (as long as your code simply ignores them).

How to hash a filename down to a small number or digit for output processing

I am not a Perl programmer but I've inherited existing code that is going to a directory, finding all files iren that folder and subfolder (usually JPG or Office files) and then converting this into a single file to use to load into a SQL Server database. The customer has about 500,000 of these files.
It takes about 45 mins to create the file and then another 45 mins for SQL to load the data. Crudely, it's doing about 150 per second which is reasonable but time is the issue for the job. There are many reasons I don't want to use other techniques so please don't suggest other options unless closely aligned to this process.
What I was considering is to improve speed by running something like 10 processes concurrent. Each process would get passed another argument (0-9). Each process would go to the directory and find all files as it is currently doing but for each file found, it would hash or kludge the filename down to a single digit (0-9) and if that matched the supplied argument, the process would process that file and write it out to it's unique file stream.
Then I would have 10 output files at the end. I doubt that the SQL Server side could be improved as I would have to load to separate tables and then merge in the database and as these are BLOB objects, will not be fast.
So I am looking for some basic code or clues on what functions to use in Perl to take a variable (the file name $File) and generate a single 0 to 9 value based on that. It is probably done by getting the ascii value of each char, then adding these together to get a long number, then add these individual numbers together and eventually you'll get an answer.
Any clues or suggested techniques?
Here's an easy one to implement suggested in the unpack function documentation:
sub string_to_code {
# convert an arbitrary string to a digit from 0-9
my ($string) = #_;
return unpack("%32W*",$string) % 10;
}

combining file chunk hash value to file fingerprint

We need a file fingerprint for all uploaded files in server. Now, sha256 is chosen to be the hash function.
For large files, each file is split into several file chunks of equal size (except the last one) to transfer. sha256 values of each file chunk are provided by clients. They are re-calculated and checked by server.
However, those sha256 values cannot be combined into the sha256 value for the whole file.
So I consider changing the definition of file fingerprint:
For files smaller than 1GB, the sha256 value is the fingerprint.
For files larger than 1GB, it is sliced into 1GB chunks. Each chunk has its own sha256 value, denoted as s0, s1, s2 (all are integer value).
When the first chunk received:
h0 = s0
When second chunk received
h1 = SHA256(h0 << 256 + s1)
This is essentially concatenating two hash values and hash it again. This process is repeated until all chunks received. And the final value hn is used as the file fingerprint.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
I have googled a lot. And read a few articles on combine_hash functions in various languages or frameworks. Different author chooses different bit mangling hash functions and most of them are said to be working well.
In my case, however, the efficiency is not a concern. But the fingerprint is stored and used as the file content identifier system-wide.
My primary concern is if the naive method listed above will introduce more conflicts than sha256 itself?
If sha256 is not a good choice for combining hash values in our case, is there any recommendations?
You are essentially reinventing Merkle tree.
What you'll have to do is to split your large files into equally-sized chunks (sans last fragment), compute hash for each of those chunks, and then combine them pairwise until there is a single ultimate hash value. Note that the "root" hash will not be equal to the hash of the original file, but that's not required to validate the integrity of the entire file.

Can a FAT filesystem support multiple references to a file?

Can a FAT based file system be modified to support multiple references to a file (i.e. aliases) by using the same FAT block sequence in directory table entries?
No because then when any reference was deleted, the file would be added to free space and possibly reused. This would result in two different files sharing space with any write to one corrupting the other.
This could work if the file system was immutable. For example if it was written to an unwritable medium.
Surely, you can have directory items points to same FAT records, but there are two things you should keep in mind:
1) never run any standard check disk utilities otherwise you get it wrong
2) you have to implement own delete operation to remove records from directory which points to the same item that you delete.
UPD: answer consider that question has 'can be modified' approach
The FAT File System stores all information about a file in a single structure inside a directory, except the addresses of disk blocks that contain file data. Disk block numbers of all files are kept in a File Allocation Table (FAT).
Since the link information and file container information are bound together in a single structure, FAT file system does not support multiple links to a single file. It does not support symbolic links either, though it could have. However, Windows supports shortcuts that are similar to symbolic links.

Does metadata alter hash for a file?

We know that hash value for a file is independent of filename.
And I did some experiment and it proved that in terms of mac os, the change of label(red,..), keywords, description (in open meta) of a file do not alter hash value.
But the change of metadata in jpeg does change the hash.
So I start to wonder why it holds? Any clue or inspiring tutorial?
The tool that you used apparently hashed what the OS considers as file contents, which in the case of a JPEG includes some metadata defined in the JPEG standard. Keywords, description, etc. are stored outside of the file contents proper by the filesystem.
(What is considered data and what metadata can be rather arbitrary and dependent on the context, e.g. the processing application and platform.)
There are different ways that metadata is stored.
For structured storage files created by COM applications, it's embedded directly in the file data. This would change the file's hash if the document properties were changed. On volumes formatted with NTFS 5 (Win2k and later), document properties can be added to any type of file and are stored in alternate data streams. I assume the same is true for the OS X filesystem.