Doing a hash by hand/mathematically - hash

I want to learn how to do a hash by hand (like with paper and pencil). Is this feasible? Any pointers on where to learn about this would be appreciated.

That depends on the hash you want to do. You can do a really simple hash by hand pretty easily -- for example, one trivial one is to take the ASCII values of the string, and add them together, typically doing something like a left-rotate between characters. So, to hash the string "Hash", we'd start with the ASCII values of the letters (in hex): 48 61 73 68. We'll add those together, rotating our result left 4 bits (in a 16-bit word) between letters:
0048 + 0061 = 00A9
00A9 <<< 4 = 0A90
0A90 + 0073 = 0B03
B03 <<< 4 = B030
B030 + 68 = B098
Result: B098
Doing a cryptographic hash by hand would be a rather different story. It's certainly still possible, but would be extremely tedious, to put it mildly. A cryptographic hash is typically quite a bit more complex, and (more importantly) almost always has a lot of "rounds", meaning that you basically repeat a set of steps a number of times to get from the input to the output. Speaking from experience, just stepping through SHA-1 in a debugger to be sure you've implemented it correctly is a pain -- doing it all by hand would be pretty awful (but as I said, certainly possible anyway).

You can start by looking at
Hash function

I would suggest trying a CRC, since it seems to me to be the easiest to do by hand: https://en.wikipedia.org/wiki/CRC32#Computation .
You can do a smaller length than standard (it's usually 32 bit) to make things easier.

Related

Is the file hashing/checksum value case insensitive?

My question is only about file hashing rather than hashing function in general. My assumption is that the value of a file checksum/hashing is case insensitive. My concern is that I cannot find any online documentation to confirm that. I only got the following two points to support my claim.
This link contains some file hash values. None of them contains any capital letter. https://www.virtualbox.org/download/hashes/6.1.2/SHA256SUMS
When I use Powershell Get-FileHash cmdlet, all returns are capitals. https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.utility/get-filehash?view=powershell-7
Can anyone help me confirm my assumption, and provide some documentation on files in Windows as well as in Linux OS?
Hashes and checksums are often presented in hexadecimal notation. Although it is common to use upper case A-F instead of lower case a-f, it does not make any difference.
As for a reference, the question is so basic that it's hard to find a solid reference. One is ISO/IEC 9899 standard for the C programming language:
A hexadecimal constant consists of the prefix 0x or 0X followed by a
sequence of the decimal digits and the letters a (or A) through f (or
F) with values 10 through 15 respectively.
In some use cases, such as CSS lower case might be preferred, as it is more pleasant to read among other lower case characters. .Net's Int32.ToString supports standard numeric formaters. x for lower case, X for upper case.
In System.Convert, there's ToInt32 that will convert values from one base into 32 bit integers. Let's see how hex digit AA is converted to decimal in different cases. Like so,
[convert]::toint32("aa", 16)
170
[convert]::toint32("AA", 16)
170
[convert]::toint32("aA", 16)
170
[convert]::toint32("Aa", 16)
170
Every letter case combination represents the same decimal value, 170. Don't try this on hashes though, as those are usually larger than 32 bit integers.
My question is only about file hashing rather than hashing function in general. My assumption is that the value of a file checksum/hashing is case insensitive.
Hashes are byte sequences, they don't have case at all.
Hashes are generally encoded as hexadecimal for display, for which the 6 "letters" (a to f) can be either case. That's mostly a style issue though I've known system which did object when getting the "wrong" case (some would only accept lowercase, others only uppercase).
Also beware that e.g. it's not unheard of to store or show hashes as base64 where case is relevant. Without knowing why you're asking (e.g. is it idle musing, or do you have an actual use case) it's hard to answer completely categorically.

Calculating partial hash collisions in openCl

I would like to find 2 SHA-256 hashes of 2 strings (both of which start with "helloworld" and then have a number of random ascii characters following) where the first n characters of the hashes match, with n being as large as possible.
for example:
String 1 = helloworld\V.T ao>
String 2 = helloworld EF{B -QMl
Hash 1 = JRFqsbBDZBUx9Ot0LviMEr6rAdKmUai/kx8HD0EskxE=
Hash 2 = JRFnMO6jm0hzdZ+jYZybNl9yVnPl9g5Y0vlz0Rf/6UE=
the first three characters of the hashes match.
Currently I am using java and MessageDigest for this which is slow, so I was thinking if i could use my GPU and openCl it could run the program much faster, however i know nothing about openCL or how i would go about coding something like this.
Does anyone know of an existing tool which could do this, or maybe some code?

Encoding that minimizes misreading / mistyping / misspeaking?

Let's say you have a system in which a fairly long key value can be accurately communicated to a user on-screen, via email or via paper; but the user needs to be able to communicate the key back to you accurately by reading it over the phone, or by reading it and typing it back into some other interface.
What is a "good" way to encode the key to make reading / hearing / typing it easy & accurate?
This could be an invoice number, a document ID, a transaction ID or some other abstract value. Let's say for the sake of this discussion the underlying key value is a big number, say 40 digits in base 10.
Some thoughts:
Shorter keys are generally better
a 40-digit base 10 value may not fit in the space given, and is easy to get lost in the middle of
the same value could be represented in base 16 in 33-34 digits
the same value could be represented in base 36 in 26 digits
the same value could be represented in base 64 in 22-23 digits
Characters that can't be visually confused with each other are better
e.g. an encoding that includes both O (oh) and 0 (zero), or S (ess) and 5 (five), could be bad
This issue depends on the font / face used to display the key, which you may be able to control in some cases (like printing on paper) but can't control in others (like web pages and email).
Also depends on whether you can control the exclusive use of upper and / or lower case -- e.g. capital D (dee) may look like O (oh) but lower case d (dee) would not; while lower case l (ell) looks like a 1 (one) while capital L (ell) would not. (With exceptions for especially exotic fonts / faces).
Characters that can't be verbally / aurally confused with each other are better
a (ay) 8 (eight)
B (bee) C (cee) D (dee) E (ee) g (gee) p (pee) t (tee) v (vee) z (zee) 3 (three)
This issue depends on the audio quality of the end-to-end channel -- bigger challenge if the expected user base could have a speech impediment, or may have to speak through a gas mask, or the communication channel could include CB radios or choppy VOIP phone systems.
Adding a check digit or two would detect errors but not help resolve errors.
An alpha - bravo - charlie - delta type dialog can help with hearing errors, but not reading errors.
Possible choices of encoding:
Base 64 -- compact, but too many hard-to-verbalize characters (underscore, dash etc.)
Base 34 -- 0-9 and A-Z but with O (oh) and I (aye) left out as the easiest to confuse with digits
Base 32 -- same as base 34 but leave out the 0 (zero) and 1 (one) as well
Is there a generally recognized encoding that is a reasonable solution for this scenario?
When I heard it first, I liked the article A Proposal for Proquints: Identifiers that are Readable, Spellable, and Pronounceable. It encodes data as a sequence of consonants and vowels. It's tied to the English language though. (Because in German, f and v sound equal, so they should not be used both.) But I like the general idea.

Looking for a good 64 bit hash for file paths in UTF16

I have a Unicode / UTF-16 encoded path. the path delimiters is U+005C '\'.
The paths are null-terminated root relative windows file system paths, e.g. "\windows\system32\drivers\myDriver32.sys"
I want to hash this path into a 64-bit unsigned integer.
It does not need to be "cryptographically sound".
The hashes should be case insensitive, but able to handle non-ascii letters.
Obviously, the hash also should scatter well.
There are some ideas that I had though of:
A) Using the windows file identifier as a "hash". In my case i do want the hash to change if the file gets moved, so this is not an option.
B) Just use a regular sting hash: hash += prime * hash + codepoint for the whole string.
I do have the feeling that the fact that the path consists of "segements" (folder names and the final file name) can be leveraged.
To sum up the needs:
1) 64bit hash
2) good distribution / few collisions for file system paths.
3) efficient
4) does not need to be secure
5) case insensitive
I would just use something straightforward. I don't know what language you are using, so the following is pseudocode:
ui64 res = 10000019;
for(i = 0; i < len; i += 2)
{
ui64 merge = ucase(path[i]) * 65536 + ucase(path[i + 1]);
res = res * 8191 + merge; // unchecked arithmetic
}
return res;
I'm assuming that path[i + 1] is safe on the basis that if len is odd then in the last case it will read the U+0000 safely.
I wouldn't make use of the fact that there are gaps caused by the gaps in UTF-16, by lower-case and title-case characters, and by characters invalid for paths, because these are not distributed in a way to make use of this fact something that could be used speedily. Dropping by 32 (all chars below U+0032 are invalid in path names) wouldn't be too expensive, but it wouldn't improve the hashing too much either.
Cryptographically secure hashes might not be very efficient in terms of speed, but there are implementations available for virtually any programming language.
Whether using them is feasible for your application depends on how much you depend on speed – a benchmark would give you an appropriate answer to that.
You could use a sub-string of such a hash, e.g. MD5 on your path, previously converted to lower case so that the hash is effectively case-insensitive (requires that you use a method for lower-casing which knows how to convert all UTF-16 non-standard characters that may occur in the file system).
Cryptographically secure hashes have the benefit of quite even distribution no matter which sub-string part you take because they are designed to be non-predictable, i.e. each part of the hash ideally depends on the entire hashed data as any other part of it.
Even if you do not need a cryptographic hash, you can still use one, and since your problem is not about security, then a "broken" cryptographic hash would be fine. I suggest MD4, which is quite fast. On my PC (a 2.4 GHz Core2 system, using a single core), MD4 hashes more than 700 MB/s, and even for small inputs (less than 50 bytes) it can process about 8 millions messages par second. You may find faster non-cryptographic hashes, but it already takes a rather specific situation for it to make a measurable difference.
For the specific properties you are after, you would need:
To "normalize" characters so that uppercase letters are converted to lowercase (for case insensitivity). Note that, generally speaking, case-insensitivity in the Unicode world is not an easy task. From what you explain, I gather that you are only after the same kind of case-insensitivity that Windows uses for file accesses (I think that it is ASCII-only, so conversion uppercase->lowercase is simple).
To truncate the output of MD4. MD4 produces 128 bits; just use the first 64 bits. This will be as scattered as you could wish for.
There are MD4 implementations available in may places, including right in the RFC 1320 I link to above. You may also find opensource MD4 implementations in C and Java in sphlib.
You could just create a shared library in C# and use the FileInfo class to obtain the full path of a directory or file. Then use .GetHashCode() in the path, like this:
Hash = fullPath.GetHashCode();
or
int getHashCode(string uri)
{
if (uri == null) throw new ArgumentNullException(nameof(uri));
FileInfo fileInfo = new FileInfo(uri);
return fileInfo.FullName.GetHashCode();
}
Altough this is just a 32 bits code, you duplicate it or append another HashCode based on some other characteristics of the file.

How remove the warning "large integer implicitly truncated" for sqlite/unicode support?

I use the solution of http://ioannis.mpsounds.net/2007/12/19/sqlite-native-unicode-like-support/ for my POS App for the iPhone, and work great.
However, as say in the comments:
For instance, sqlite_unicode.c line 1861 contains integral constants greater than 0xffff but are declared as unsigned short. I wonder how I should cope with that.
I'm fixing all the warnings in my project and this is the last one. The code is this:
static unsigned short unicode_unacc_data198[] = { 0x8B8A, 0x8D08, 0x8F38, 0x9072, 0x9199, 0x9276, 0x967C, 0x96E3, 0x9756, 0x97DB, 0x97FF, 0x980B, 0x983B, 0x9B12, 0x9F9C, 0x2284A, 0x22844, 0x233D5, 0x3B9D, 0x4018, 0x4039, 0x25249, 0x25CD0, 0x27ED3, 0x9F43, 0x9F8E, 0xFFFF, 0xFFFF, 0xFFFF, 0xFFFF, 0xFFFF, 0xFFFF };
I don't know about this hex stuff, so what to do? I don't get a error, not know if this could cause one in the future...
Yes, 0x2284A is indeed larger than 0xFFFF, which is the largest a 16-bit unsigned integer can contain.(*)
This is a lookup table for mapping characters with diacritical marks to basic unaccented characters. For some reason, a few mappings are defined that point to characters outside the ‘Basic Multilingual Plane’ of Unicode characters that fit in 16 bits.
U+2284A and the others above are highly obscure extended Chinese characters. I'm not sure why a character in the BMP would refer to such a character as its base unaccented version. Maybe it's an error in the source data used to generate the tables, or maybe it's just another weird quirk of the Chinese writing system. Either way, it's extremely unlikely you'll ever need that mapping. So just change all the five-digit hex codes in this array to be 0xFFFF instead (which seems to be what the code is using to signify ‘no mapping’).
(*: in theory a short could be more than 16 bits, but in reality it isn't going to be. If it were it looks like this code would totally fall over anyway, as it's freely mixing short with u16 pointers.)