How do you embed a hash into a file recursively? - hash

Simplest case: You want to make a text file which says "The MD5 hash of this file is FOOBARHASH". How do you embed the hash, knowing that the embedded hash value and the hash of the file are inter-related?
eg, Cisco embeds hash values into their IOS images, which can be verified like this:
cisco# verify s72033-advipservicesk9_wan-mz.122-33.SXH7.bin
Embedded Hash MD5 : D2BB0668310392BAC803BE5A0BCD0C6A
Computed Hash MD5 : D2BB0668310392BAC803BE5A0BCD0C6A
Maybe I'm mistaken, but trying to figure out how to do this blows my mind.
Originally, I stated that Ubuntu ISOs have a text file containing the MD5 hash of the entire ISO file. This was not correct: on second look, the md5sum.txt file contains hashes for individual files.

You don't. The hash value is computed by putting a "dummy" or an empty string where the signature should be, hashing that document, and then inserting the signature value into the text. To verify the signature of the document, you strip the signature out, hash the document without the signature, and compare the result to the signature you stripped out.
If you like that sort of challenge though, consider writing a program to produce self-describing pangrams:
This Pangram contains four as, one b, two cs, one d, thirty es, six fs, five gs, seven hs, eleven is, one j, one k, two ls, two ms, eighteen ns, fifteen os, two ps, one q, five rs, twenty-seven ss, eighteen ts, two us, seven vs, eight ws, two xs, three ys, & one z.
Have fun!

Related

SHA256 odd hex input

I was trying to hash 'abc' as a hex number input on two different sites, but both give different hash.
Later I found out, that one site interprets it as '0abc' and the second one as 'abc0'.
Since I'm finishing my sha256 hashing program, I was wondering which one is correct.
Thank you

Is the file hashing/checksum value case insensitive?

My question is only about file hashing rather than hashing function in general. My assumption is that the value of a file checksum/hashing is case insensitive. My concern is that I cannot find any online documentation to confirm that. I only got the following two points to support my claim.
This link contains some file hash values. None of them contains any capital letter. https://www.virtualbox.org/download/hashes/6.1.2/SHA256SUMS
When I use Powershell Get-FileHash cmdlet, all returns are capitals. https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.utility/get-filehash?view=powershell-7
Can anyone help me confirm my assumption, and provide some documentation on files in Windows as well as in Linux OS?
Hashes and checksums are often presented in hexadecimal notation. Although it is common to use upper case A-F instead of lower case a-f, it does not make any difference.
As for a reference, the question is so basic that it's hard to find a solid reference. One is ISO/IEC 9899 standard for the C programming language:
A hexadecimal constant consists of the prefix 0x or 0X followed by a
sequence of the decimal digits and the letters a (or A) through f (or
F) with values 10 through 15 respectively.
In some use cases, such as CSS lower case might be preferred, as it is more pleasant to read among other lower case characters. .Net's Int32.ToString supports standard numeric formaters. x for lower case, X for upper case.
In System.Convert, there's ToInt32 that will convert values from one base into 32 bit integers. Let's see how hex digit AA is converted to decimal in different cases. Like so,
[convert]::toint32("aa", 16)
170
[convert]::toint32("AA", 16)
170
[convert]::toint32("aA", 16)
170
[convert]::toint32("Aa", 16)
170
Every letter case combination represents the same decimal value, 170. Don't try this on hashes though, as those are usually larger than 32 bit integers.
My question is only about file hashing rather than hashing function in general. My assumption is that the value of a file checksum/hashing is case insensitive.
Hashes are byte sequences, they don't have case at all.
Hashes are generally encoded as hexadecimal for display, for which the 6 "letters" (a to f) can be either case. That's mostly a style issue though I've known system which did object when getting the "wrong" case (some would only accept lowercase, others only uppercase).
Also beware that e.g. it's not unheard of to store or show hashes as base64 where case is relevant. Without knowing why you're asking (e.g. is it idle musing, or do you have an actual use case) it's hard to answer completely categorically.

Looking for a good 64 bit hash for file paths in UTF16

I have a Unicode / UTF-16 encoded path. the path delimiters is U+005C '\'.
The paths are null-terminated root relative windows file system paths, e.g. "\windows\system32\drivers\myDriver32.sys"
I want to hash this path into a 64-bit unsigned integer.
It does not need to be "cryptographically sound".
The hashes should be case insensitive, but able to handle non-ascii letters.
Obviously, the hash also should scatter well.
There are some ideas that I had though of:
A) Using the windows file identifier as a "hash". In my case i do want the hash to change if the file gets moved, so this is not an option.
B) Just use a regular sting hash: hash += prime * hash + codepoint for the whole string.
I do have the feeling that the fact that the path consists of "segements" (folder names and the final file name) can be leveraged.
To sum up the needs:
1) 64bit hash
2) good distribution / few collisions for file system paths.
3) efficient
4) does not need to be secure
5) case insensitive
I would just use something straightforward. I don't know what language you are using, so the following is pseudocode:
ui64 res = 10000019;
for(i = 0; i < len; i += 2)
{
ui64 merge = ucase(path[i]) * 65536 + ucase(path[i + 1]);
res = res * 8191 + merge; // unchecked arithmetic
}
return res;
I'm assuming that path[i + 1] is safe on the basis that if len is odd then in the last case it will read the U+0000 safely.
I wouldn't make use of the fact that there are gaps caused by the gaps in UTF-16, by lower-case and title-case characters, and by characters invalid for paths, because these are not distributed in a way to make use of this fact something that could be used speedily. Dropping by 32 (all chars below U+0032 are invalid in path names) wouldn't be too expensive, but it wouldn't improve the hashing too much either.
Cryptographically secure hashes might not be very efficient in terms of speed, but there are implementations available for virtually any programming language.
Whether using them is feasible for your application depends on how much you depend on speed – a benchmark would give you an appropriate answer to that.
You could use a sub-string of such a hash, e.g. MD5 on your path, previously converted to lower case so that the hash is effectively case-insensitive (requires that you use a method for lower-casing which knows how to convert all UTF-16 non-standard characters that may occur in the file system).
Cryptographically secure hashes have the benefit of quite even distribution no matter which sub-string part you take because they are designed to be non-predictable, i.e. each part of the hash ideally depends on the entire hashed data as any other part of it.
Even if you do not need a cryptographic hash, you can still use one, and since your problem is not about security, then a "broken" cryptographic hash would be fine. I suggest MD4, which is quite fast. On my PC (a 2.4 GHz Core2 system, using a single core), MD4 hashes more than 700 MB/s, and even for small inputs (less than 50 bytes) it can process about 8 millions messages par second. You may find faster non-cryptographic hashes, but it already takes a rather specific situation for it to make a measurable difference.
For the specific properties you are after, you would need:
To "normalize" characters so that uppercase letters are converted to lowercase (for case insensitivity). Note that, generally speaking, case-insensitivity in the Unicode world is not an easy task. From what you explain, I gather that you are only after the same kind of case-insensitivity that Windows uses for file accesses (I think that it is ASCII-only, so conversion uppercase->lowercase is simple).
To truncate the output of MD4. MD4 produces 128 bits; just use the first 64 bits. This will be as scattered as you could wish for.
There are MD4 implementations available in may places, including right in the RFC 1320 I link to above. You may also find opensource MD4 implementations in C and Java in sphlib.
You could just create a shared library in C# and use the FileInfo class to obtain the full path of a directory or file. Then use .GetHashCode() in the path, like this:
Hash = fullPath.GetHashCode();
or
int getHashCode(string uri)
{
if (uri == null) throw new ArgumentNullException(nameof(uri));
FileInfo fileInfo = new FileInfo(uri);
return fileInfo.FullName.GetHashCode();
}
Altough this is just a 32 bits code, you duplicate it or append another HashCode based on some other characteristics of the file.

hash function to index similar text

I'm searching about a sort of hash function to index similar text. So for example if we have two very long text called "A" and "B" where A and B differ not so much, then the hash function (called H) applied to A and B should return the same number.
So H(A) = H(B) where A and B are similar text.
I tried the "DoubleMetaphone" (I use italian language text), but I saw that it depends very strong from the string prefixes. For example:
A = "This is the very long text that I want to hash"
B = "This is the very"
==> doubleMetaPhone(A) = doubleMetaPhone(B)
And this is not so good for me, beacause strings with the same prefix could be compared as similar and I don't want this.
Could anyone suggest me any other way?
see http://en.wikipedia.org/wiki/Locality_sensitive_hashing
You problem is (close to) insoluble for many distance functions between strings.
Most distance functions (e.g. edit distance) allow you to transform a string into another string via a sequence of 1-distance transformations:
"AAAA" -> "AAAB" -> "AAABC"
according to your requirements, the first and second strings should have the same hash value. But so must the second and the third, and so on. So all the strings will have to have the same hash, if we allow a pair with distance=1 to have the same hash value.
Even if we impose a higher threshold on the distance (maybe in relation to string length), we'll end up with a messy result.
A better (IMO) approach is to find an equivalence relation on the set of strings, such that each string in each equivalence class has the same hash. A possibility is to define classes by their distance to a predefined string (e.g. edit distance from "AAAAA"), and the distance itself would be the hash value. Probably this approach would not be the best in your case, but maybe with some extra info on the problem we can come up with a better equivalence relation.

How can I generate this hash?

I'm new to programming (just started!) and have hit a wall recently. I am making a fansite for World of Warcraft, and I want to link to a popular site (wowhead.com). The following page shows what I'm trying to figure out: http://www.wowhead.com/?talent#ozxZ0xfcRMhuVurhstVhc0c
From what I understand, the "ozxZ0xfcRMhuVurhstVhc0c" part of the link is a hash. It contains all the information about that particular talent spec on the page, and changes whenever I add or remove points into a talent. I want to be able to recreate this part, so that I can then link my users directly to wowhead to view their talent trees, but I havn't the foggiest idea how to do that. Can anyone provide some guidance?
The first character indicates the class:
0 Druid
c Hunter
o Mage
s Paladin
b Priest
f Rogue
h Shaman
I Warlock
L Warrior
j Death Knight
The remaining characters indicate where in each tree points have been allocated. Each tree is separate, delimited by 'Z'. So if e.g. all the points are in the third tree, then the 2nd and 3rd characters will be "ZZ" indicating "end of first tree" and "end of second tree".
To generate the code for a given tree, split the talents up into pairs, going left-to-right and top-to-bottom. Each pair of talents is represented by a single character. So for example, in the DK's Blood tree segment, the first character will indicate the number of points allocated to Butchery and Subversion, and the second character will stand for Blade Barrier and Bladed Armor.
What character represents each allocation among the pair? I'm sure there's an algorithm, probably based on the ASCII character set, but all I've worked out so far is this lookup table. Find the number of points in the first talent along the top, and the number of points in the second talent along the left side. The encoded character is at the intersection.
0 1 2 3 4 5
0 0 o b h L x
1 z k d u p t
2 M R r G T g
3 c s f I j e
4 m a w N n v
5 V q i A y E
So if our Death Knight has one point in Butchery and two points in Subversion, the first character is 'R'. If instead we put no points in those two and five in Blade Barrier, the first two characters will be "0x". Trailing '0's (all the other pairs in the tree with no points allocated) can be omitted, as can trailing 'Z' delimiters (when there are no points in the subsequent trees). For one final example, the entire code for a DK with just a single point in Toughness would be "jZ0o": "Death Knight", "End of the first tree", "No points in the first pair of talents", "one point in the first talent of the second pair".
Can anyone work out what function generates the lookup table above? There's probably a clue in the codes for the classes: in alphabetical order (except for the DK which was added to the game after the others), they correspond to a series in the lookup table of (0,0), (0,3), (1,0), (1,3), (2,0), etc.
If you go to http://www.wowhead.com/?talent and start using the talent tree you can see the mysterious code being built up in the address bar as you click on the various boxes. So it's definitely not a hash but some kind of encoded structure data.
As the code is built up as you click the logic for building the code will be in the JavaScript on that page.
So my advice is do a view source on the page, download the JavaScript files and have a look at them.
I think it isn't a hash value, because hash values are normally one-ways values. This means you cannot (easily) restore the original information from which the hash code was generated.
Best thing would be to contact someone from wowhead.com and ask them how to interpret this information. I am sure they will help you out with some information about what type of encoding they use for the parameters. But without any help of the developers from wowhead.com it is almost impossible to figure out what information is encoded into this parameter.
I am not even sure the parameter you mentioned contains the talents of your character. Maybe it's just a session id or something like that. Take a look into the post data your browser sends to the server, it may contain a hidden field with the value you are searching for (you can use Tamper Data Firefox Addon).
I don't think ozxZ0xfcRMhuVurhstVhc0c is a hash value. I think it is a key (probably encrypted/encoded in some way). The server uses this key to retrieve information from it database. Since you don't have access to the database you don't know which key is needed, let alone how to encode it.
You need the original function that generates the hash.
I don't think that's public though :(
Check this out: hash wikipedia
Good luck learning how to program!
These hashes are hard to 'reverse engineer' unless you know how it was generated.
For example, it could be:
s1 = "random_string-" + score;
hash = encrypt(s1)
...etc
so it is hard to get the original data back from the hash (that is the whole point anyway).
your best bet would be link to the profile that would have the latest score ..etc