LZW compression & dictionary implementation using hashing - hash

I have a long text which has to be compressed using LZW compression algorithm. I have to assign 16 bit code for sequence of ASCII characters. for eg 'aa' will have 16-bit code '0000000010000000' (just available after 'DEL' ie 0000000001111111). Now before starting compression I have to initialize the dictionary which is
'NUL':0000000000000000
'SOH': 0000000000000001,
.
.
.
.
'DEL':0000000001111111.
I have to use hashing to implement this dictionary. Now I need help in understanding this statement that how hashing is used to implement the dictionary. Also please suggest me hash function that will do the job. Side note- I have to use quadratic probing to handle collisions.

LZW doesn't require collision handling because its dictionary hash requires just 32 MB memory and it is not a problem in 2019. See sparse array dictionary in lzws.

Related

Basics of MD5: How to know hash bit length and symmetry?

I'm curious about some basics of MD5 encryption I couldn't get from Google, Java questions here nor a dense law paper:
1-How to measure, in bytes, an MD5 hash string? And does it depends if the string is UNICODE or ANSI?
2-Is MD5 an assymetric algorythm?
Example: If my app talks (http) to a REST webservice using a key (MD5_128 hash string, ANSI made of 9 chars) to unencrypt received data, does that account for 9x8=72 bytes in an assymetric algorithm?
I'm using Windevs 25 in Windows, using functions like Encrypt and HashString, but I lack knowledge about encryption.
Edit: Not asnwered yet, but it seems like I need to know more about charsets before jumping to hashes and encryption. https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
An MD5 hash is 128 bits, 16 bytes. The result is binary, not text, so it is neither "ANSI" nor "Unicode". Like all hashes, it is asymmetric, which should be obvious from the fact that you can hash inputs which are longer than 128 bits. Since it is asymmetric, you cannot "unencrypt" (decrypt) it. This is by design and intentional.

What is this hash algorithm?

Being a pentester, I have encountered a hash divided in two parts (the first one probably being the salt) seemingly encoded in Base64 but I am unable to find out the encryption type.
The input that gave me this hash is the string "password". Is anybody able to give me a hint ?
67Wm8zeMSS0=
s9bD0QOa7A6THDMLa39+3LmXgcxzUFdmszeZdlTUzjY=
Thanks in advance
Maybe it's SHA-256 encoded (or any other 256 bit hash algorithm), because if you base64 decode it and hex encode you get:
ebb5a6f3378c492d
b3d6c3d1039aec0e931c330b6b7f7edcb99781cc73505766b337997654d4ce36
The first has an length of 16 and the second a length of 64. That's probably not a coincidence.
Edit: Maybe it's hashed multiple times; an iterated hash. As this post says it is better to decompile the software.

How to get Perl crypt to encrypt more than 8 characters?

Only the first 8 characters is encrypted when the Perl crypt function is used. Is there a way to get it to use more characters?
As an example:
$crypted_password = crypt ("PassWord", "SALT");
and
$crypted_password = crypt ("PassWord123", "SALT");
returns exactly the same result. $crypted_password has exactly the same value.
Would love to use crypt because it is a quick and easy solution to some none reversible encryption but this limit does not make it useful for anything serious.
To quote from the documentation:
Traditionally the result is a string of 13 bytes: two first bytes of the salt, followed by 11 bytes from the set [./0-9A-Za-z], and only the first eight bytes of PLAINTEXT mattered. But alternative hashing schemes (like MD5), higher level security schemes (like C2), and implementations on non-Unix platforms may produce different strings.
So the exact return value of crypt is system dependent, but it often uses an algorithm that only looks at the first 8 byte of the password. These two things combined make it a poor choice for portable password encryption. If you're using a system with a stronger encryption routine and don't try to check those passwords on incompatible systems, you're fine. But it sounds like you're using an OS with the old crappy DES routine.
So a better option is to use a module off of CPAN that does the encryption in a predictable, more secure way.
Some searching gives a few promising looking options (That I haven't used and can't recommend one over another; I just looked for promising keywords on metacpan):
Crypt::SaltedHash
Authen::Passphrase::SaltedDigest
Crypt::Bcrypt::Easy
Crypt::Password::Util

How strong is this hashing technique?

Use AES/Rijndael or any symmetric encryption.
Encrypt the hidden value using itself as the key and a random IV.
Store the ciphertext + IV. Discard everything else.
To check the hash: try to decrypt using provided plaintext. If provided == decrypted, then it's OK.
Ignore ciphertext length problems.
Is this secure?
There is an existing method of generating a hash or MAC using an block cipher like AES. It's called CBC-MAC. It's operation is pretty simple. Just encrypt the data to be hashed using AES in CBC mode and output the last block of the ciphertext, discarding all prior blocks of the ciphertext. The IV for CBC would normally be left as zero, and the AES key can be used to produce a MAC.
CBC-MAC does have some limitations. Do not encrypt and MAC your data using the same key and IV, or the MAC will simply be equal to the last block of the ciphertext. Also, the size of the hash/MAC is limited to the size of block cipher. Using AES with CBC-MAC produces a 128 bit MAC, and MACs are usually expected to be at least this size.
Something worth noting is that CBC-MAC is a very inefficient way to produce a MAC. A better way to go would be to use SHA2-256 or SHA2-512 in HMAC. In my recent tests, using SHA256 in HMAC produces a result approximately as fast as AES in CBC-MAC, and the HMAC in this case is twice as wide. However, new CPUs will be produced with hardware acceleration for AES, allowing AES in CBC-MAC mode to be used to very quickly produce a 128 bit MAC.
As described, it has a problem in that it reveals information about the length of the data being hashed. That in itself would be some kind of weakness.
Secondly ... it is not clear that you would be able to check the hash. It would be necessary to store the randomly generated IV with the hash.
I was thinking about this while bicycling home, and one other possible issue came to mind. With a typical hashing scheme to store a password, it is best to run the hash a bunch of iterations (e.g., PBKDF2). This makes it much more expensive to run a brute force attack. One possibility to introduce that idea into your scheme might be to repeatedly loop over the encrypted data (e.g., feed back the encrypted block back into itself).

Looking for a good 64 bit hash for file paths in UTF16

I have a Unicode / UTF-16 encoded path. the path delimiters is U+005C '\'.
The paths are null-terminated root relative windows file system paths, e.g. "\windows\system32\drivers\myDriver32.sys"
I want to hash this path into a 64-bit unsigned integer.
It does not need to be "cryptographically sound".
The hashes should be case insensitive, but able to handle non-ascii letters.
Obviously, the hash also should scatter well.
There are some ideas that I had though of:
A) Using the windows file identifier as a "hash". In my case i do want the hash to change if the file gets moved, so this is not an option.
B) Just use a regular sting hash: hash += prime * hash + codepoint for the whole string.
I do have the feeling that the fact that the path consists of "segements" (folder names and the final file name) can be leveraged.
To sum up the needs:
1) 64bit hash
2) good distribution / few collisions for file system paths.
3) efficient
4) does not need to be secure
5) case insensitive
I would just use something straightforward. I don't know what language you are using, so the following is pseudocode:
ui64 res = 10000019;
for(i = 0; i < len; i += 2)
{
ui64 merge = ucase(path[i]) * 65536 + ucase(path[i + 1]);
res = res * 8191 + merge; // unchecked arithmetic
}
return res;
I'm assuming that path[i + 1] is safe on the basis that if len is odd then in the last case it will read the U+0000 safely.
I wouldn't make use of the fact that there are gaps caused by the gaps in UTF-16, by lower-case and title-case characters, and by characters invalid for paths, because these are not distributed in a way to make use of this fact something that could be used speedily. Dropping by 32 (all chars below U+0032 are invalid in path names) wouldn't be too expensive, but it wouldn't improve the hashing too much either.
Cryptographically secure hashes might not be very efficient in terms of speed, but there are implementations available for virtually any programming language.
Whether using them is feasible for your application depends on how much you depend on speed – a benchmark would give you an appropriate answer to that.
You could use a sub-string of such a hash, e.g. MD5 on your path, previously converted to lower case so that the hash is effectively case-insensitive (requires that you use a method for lower-casing which knows how to convert all UTF-16 non-standard characters that may occur in the file system).
Cryptographically secure hashes have the benefit of quite even distribution no matter which sub-string part you take because they are designed to be non-predictable, i.e. each part of the hash ideally depends on the entire hashed data as any other part of it.
Even if you do not need a cryptographic hash, you can still use one, and since your problem is not about security, then a "broken" cryptographic hash would be fine. I suggest MD4, which is quite fast. On my PC (a 2.4 GHz Core2 system, using a single core), MD4 hashes more than 700 MB/s, and even for small inputs (less than 50 bytes) it can process about 8 millions messages par second. You may find faster non-cryptographic hashes, but it already takes a rather specific situation for it to make a measurable difference.
For the specific properties you are after, you would need:
To "normalize" characters so that uppercase letters are converted to lowercase (for case insensitivity). Note that, generally speaking, case-insensitivity in the Unicode world is not an easy task. From what you explain, I gather that you are only after the same kind of case-insensitivity that Windows uses for file accesses (I think that it is ASCII-only, so conversion uppercase->lowercase is simple).
To truncate the output of MD4. MD4 produces 128 bits; just use the first 64 bits. This will be as scattered as you could wish for.
There are MD4 implementations available in may places, including right in the RFC 1320 I link to above. You may also find opensource MD4 implementations in C and Java in sphlib.
You could just create a shared library in C# and use the FileInfo class to obtain the full path of a directory or file. Then use .GetHashCode() in the path, like this:
Hash = fullPath.GetHashCode();
or
int getHashCode(string uri)
{
if (uri == null) throw new ArgumentNullException(nameof(uri));
FileInfo fileInfo = new FileInfo(uri);
return fileInfo.FullName.GetHashCode();
}
Altough this is just a 32 bits code, you duplicate it or append another HashCode based on some other characteristics of the file.