How to use a hash function in arduino - hash

I am developing a project with arduino and I want to use a hash function on the data generated by a temperature sensor?
To be more specific I want to use the SHA-1 hash.

see https://en.wikipedia.org/wiki/SHA-1 and you notice, an 8-bit controller which stores integers in LittleEndian is not the optimal platform for your idea.
The available RAM (2kB) of an atmega328 should be sufficient, if you do not need too much RAM for the raw data.
So, have fun ;)
My main concern is rather the "why"?
What's wrong with a CRC checksum or similar, or a eventually a private hash algorithm, to ensure data integrity?

Related

Hash from object's content as object ID: fast alternatives for SHA256

I'm working on design of Content-addressable storage, so I'm looking for a hash function to generate object identifiers. Every object should get short ID based on its content in that way: object_id = hash(object_content).
Prerequisites:
Hash-function should be fast.
Collision probability must be as low as possible.
Optimal ID length is 32 bytes in order to address 256^32 objects at max (but this requirement may be relaxed).
Taking into account these requirements, I picked up SHA256 hash, but unfortunately it's not fast enough for my purposes. The fastest implementations of SHA256 that I was able to benchmark were openssl and boringssl: on my desktop Intel Core I5 6400 it gave about 420 MB/s per core. Other implementations (like crypto/rsa in Go) are even slower. I would like to replace SHA256 with other hash function that provides the same collision guarantees as SHA256, but gives betters throughput (at least 600 MB/s per core).
Please share your opinion about possible options to solve this problem.
Also I would like to note that hardware update (like purchasing modern CPU with AVX512 instruction set) is not possible. The main point is to find hash function that will provide better performance on commodity hardware.
Check out Cityhash and HighwayHash. Both have 256-bit variants, and much faster than SHA256. Cityhash is faster, but it is a non-cryptographic hash. HighwayHash is slower (but still faster than SHA256), and a secure hash.
All modern non-cryptographic hashes are much faster than SHA256. If you're willing to use a 128-bit hash, you'll have more options.
Note, that you may want to consider using a 128-bit hash, as it may be adequate for your purpose. For example, if you have 1010 different objects, the probability that you have a collision with a quality 128-bit hash is less than 10-18. Check out the table here.
Finally, for my use-case BLAKE2S_256 turns out to be a better option than SHA256.

Which SHA-2 function will Facebook use?

I read that Facebook on the 1st Oct 2015 will move from SHA-1 to SHA-2 and we have to update our applications: https://developers.facebook.com/blog/post/2015/06/02/SHA-2-Updates-Needed/
Do you know which function of SHA-2 will it use?
I read there are several (224, 256, 384 or 512) and one of these (SHA-224) doesn't work with the Windows XP SP3 which I use (http://blogs.msdn.com/b/alejacma/archive/2009/01/23/sha-2-support-on-windows-xp.aspx)
You don't have to care that much because usage of the SHA-224 is quite limited.
In this question CBroe has pointed out an important remark:
That blog post is about SSL connections when your app is making API
requests. This is not about anything you do with data within your app,
it is about the transport layer.
According to the https://crypto.stackexchange.com/questions/15151/sha-224-purpose
Answer by Ilmari Karonen:
Honestly, in practice, there are very few if any reasons to use
SHA-224.
As fgrieu notes, SHA-224 is simply SHA-256 with a different IV and
with 32 of the output bits thrown away. For most purposes, if you want
a hash with more than 128 but less than 256 bits, simply using SHA-256
and truncating the output yourself to the desired bit length is
simpler and just as efficient as using SHA-224. As you observe,
SHA-256 is also more likely to be available on different platforms
than SHA-224, making it the better choice for portability.
Why would you ever want to use SHA-224, then?
The obvious use case is if you need to implement an existing protocol
that specifies the use of SHA-224 hashes. While, for the reasons
described above, it's not a very common choice, I'm sure such
protocols do exist.
Also, a minor advantage of SHA-224 over truncated SHA-256 is that, due
to the different IV, knowing the SHA-224 hash of a given message does
not reveal anything useful about its SHA-256 hash, or vice versa. This
is really more of an "idiot-proofing" feature; since the two hashes
have different names, careless users might assume that their outputs
have nothing in common, so NIST changed the IV to ensure that this is
indeed the case.
However, this isn't really something you should generally rely on. If
you really need to compute multiple unrelated hashes of the same input
string, what you probably want instead is a keyed PRF like HMAC, which
can be instantiated using any common hash function (such as SHA-256).
As you've mentioned, Windows XP with SP3 doesn't support SHA-224 but it supports SHA-256:
Check also: https://security.stackexchange.com/questions/1751/what-are-the-realistic-and-most-secure-crypto-for-symmetric-asymmetric-hash
Especially: https://stackoverflow.com/a/817121/3964066
And: https://security.stackexchange.com/a/1755
Part of the Thomas Pornin's answer:
ECDSA over a 256-bit curve already achieves an "unbreakable" level of
security (i.e. roughly the same level than AES with a 128-bit key, or
SHA-256 against collisions). Note that there are elliptic curves on
prime fields, and curves on binary fields; which kind is most
efficient depends on the involved hardware (for curves of similar
size, a PC will prefer the curves on a prime field, but dedicated
hardware will be easier to build with binary fields; the CLMUL
instructions on the newer Intel and AMD processors may change that).
SHA-512 uses 64-bit operations. This is fast on a PC, not so fast on a
smartcard. SHA-256 is often a better deal on small hardware (including
32-bit architectures such as home routers or smartphones)

CRC32+Size vs MD5/SHA1

We have a storage of files and the storage uniquely identifies a file on the basis of size appended to crc32.
I wanted to know if this checksum ( crc32 + size ) would be good enough for identifying files or should we consider some other hashing technique like MD5/SHA1?
CRC is most an error detection method than a serious hash function. It helps in identify corrupting files rather than uniquely identify them.
So your choice should be between MD5 and SHA1.
If you don't have strong security needings you can choose MD5 that should be faster.
(remember that MD5 is vulnerable to collision attacks).
If you need more security you better use SHA1 or even SHA2 .
CRC-32 is not good enough; it is trivial to build collisions, i.e. two files (of the same length if you wish it so) which have the same CRC-32. Even in the absence of a malicious attacker, collisions will happen randomly once you have about 65000 distinct files with the same length.
A hash function is designed to avoid collisions. With MD5 or SHA-1, you will not get random collisions. If your setup is security-related (i.e. there is someone, somewhere, who may actively try to create collisions), then you need a secure hash function. MD5 is not secure anymore (creating collisions with MD5 is easy) and SHA-1 is somewhat weak in that respect (no actual collisions were computed, but a method for creating one is known and, while expensive, it is much less expensive than what it ought to be). The usual recommendation is to use SHA-256 or SHA-512 (SHA-256 is enough for security; SHA-512 may be a tad faster on big, 64-bit systems, but file reading bandwidth will be more limitating than hashing speed).
Note: when using a cryptographic hash function, there is no need to store and compare the file lengths; the hash is sufficient to disambiguate files.
In a non-security setup (i.e. you only fear random collisions), then MD4 can be used. It is thoroughly "broken" as a cryptographic hash function, but it still is a very good checksum, and it is really fast (on some ARM-based platforms, it is even faster than CRC-32, for a much better resistance to random collisions). Basically, you should not use MD5: if you have security issues, then MD5 must not be used (it is broken; use SHA-256); and if you do not have security issues then MD4 is faster than MD5.
The space that would be used by a CRC32+size gives you enough room for a bigger CRC which would be a much better choice. If you are not worried about malicious collision that's it in which case Thomas' answer applies.
You didn't specify a language but for example in C++ you got Boost CRC giving you CRC of the size you want (or you can afford to store).
As others have said, CRC doesn't guarantee absence of collisions. However, your problem is be solved simply by giving the files incrementing 64-bit numbers. This is guaranteed to never collide (unless you want to keep gazillion of files in one directory which is not a good idea anyway).

How safe is it to rely on hashes for file identification?

I am designing a storage cloud software on top of a LAMP stack.
Files could have an internal ID, but it would have many advantages to store them not with an incrementing id as filename in the servers filesystems, but using an hash as filename.
Also hashes as identifier in the database would have a lot of advantages if the currently centralized database should be sharded or decentralized or some sort of master-master high availability environment should be set up. But I am not sure about that yet.
Clients can store files under any string (usually some sort of path and filename).
This string is guaranteed to be unique, because on the first level is something like "buckets" that users have go register like in Amazon S3 and Google storage.
My plan is to store files as hash of the client side defined path.
This way the storage server can directly serve the file without needing the database to ask which ID it is because it can calculate the hash and thus the filename on the fly.
But I am afraid of collisions. I currently think about using SHA1 hashes.
I heard that GIT uses hashes also revision identifiers as well.
I know that the chances of collisions are really really low, but possible.
I just cannot judge this. Would you or would you not rely on hash for this purpose?
I could also us some normalization of encoding of the path. Maybe base64 as filename, but i really do not want that because it could get messy and paths could get too long and possibly other complications.
Assuming you have a hash function with "perfect" properties and assuming cryptographic hash functions approach that the theory that applies is the same that applies to birthday attacks . What this says is that given a maximum number of files you can make the collision probability as small as you want by using a larger hash digest size. SHA has 160 bits so for any practical number of files the probability of collision is going to be just about zero. If you look at the table in the link you'll see that a 128 bit hash with 10^10 files has a collision probability of 10^-18 .
As long as the probability is low enough I think the solution is good. Compare with the probability of the planet being hit by an asteroid, undetectable errors in the disk drive, bits flipping in your memory etc. - as long as those probabilities are low enough we don't worry about them because they'll "never" happen. Just take enough margin and make sure this isn't the weakest link.
One thing to be concerned about is the choice of the hash function and it's possible vulnerabilities. Is there any other authentication in place or does the user simply present a path and retrieve a file?
If you think about an attacker trying to brute force the scenario above they would need to request 2^18 files before they can get some other random file stored in the system (again assuming 128 bit hash and 10^10 files, you'll have a lot less files and a longer hash). 2^18 is a pretty big number and the speed you can brute force this is limited by the network and the server. A simple lock the user out after x attempts policy can completely close this hole (which is why many systems implement this sort of policy). Building a secure system is complicated and there will be many points to consider but this sort of scheme can be perfectly secure.
Hope this is useful...
EDIT: another way to think about this is that practically every encryption or authentication system relies on certain events having very low probability for its security. e.g. I can be lucky and guess the prime factor on a 512 bit RSA key but it is so unlikely that the system is considered very secure.
Whilst the probability of a collision might be vanishingly small, imagine serving a highly confidential file from one customer to their competitor just because there happens to be a hash collision.
= end of business
I'd rather use hashing for things that were less critical when collisions DO occur ;-)
If you have a database, store the files under GUIDs - so not an incrementing index, but a proper globally unique identifier. They work nicely when it comes to distributed shards / high availability etc.
Imagine the worst case scenario and assume it will happen the week after you are featured in wired magazine as an amazing startup ... that's a good stress test for the algorithm.

is perfect hashing without buckets possible?

I've been asked to look for a perfect hash/one way function to be able to hash 10^11 numbers.
However as we'll be using a embedded device it wont have the memory to store the relevant buckets so I was wondering if it's possible to have a decent (minimal) perfect hash without them?
The plan is to use the device to hash the number(s) and we use a rainbow table or a file using the hash as the offset.
Cheers
Edit:
I'll try to provide some more info :)
1) 10^11 is actually now 10^10 so that makes it easer.This number is the possible combinations. So we could get a number between 0000000001 and 10000000000 (10^10).
2) The plan is to us it as part of a one way function to make the number secure so we can send it by insecure means.
We will then look up the original number at the other end using a rainbow table.
The problem is that the source the devices generally have 512k-4Meg of memory to use.
3) it must be perfect - we 100% cannot have a collision .
Edit2:
4) We cant use encryption as we've been told it's not really possable on the devices and keymanigment would be a nightmare if we could.
Edit3:
As this is not sensible, Its purely academic question now (I promise)
Okay, since you've clarified what you're trying to do, I rewrote my answer.
To summarize: Use a real encryption algorithm.
First, let me go over why your hashing system is a bad idea.
What is your hashing system, anyway?
As I understand it, your proposed system is something like this:
Your embedded system (which I will call C) is sending some sort of data with a value space of 10^11. This data needs to be kept confidential in transit to some sort of server (which I will call S).
Your proposal is to send the value hash(salt + data) to S. S will then use a rainbow table to reverse this hash and recover the data. salt is a shared value known to both C and S.
This is an encryption algorithm
An encryption algorithm, when you boil it down, is any algorithm that gives you confidentiality. Since your goal is confidentiality, any algorithm that satisfies your goals is an encryption algorithm, including this one.
This is a very poor encryption algorithm
First, there is an unavoidable chance of collision. Moreover, the set of colliding values differs each day.
Second, decryption is extremely CPU- and memory-intensive even for the legitimate server S. Changing the salt is even more expensive.
Third, although your stated goal is avoiding key management, your salt is a key! You haven't solved key management at all; anyone with the salt will be able to crack the message just as well as you can.
Fourth, it's only usable from C to S. Your embedded system C will not have enough computational resources to reverse hashes, and can only send data.
This isn't any faster than a real encryption algorithm on the embedded device
Most secure hashing algorithm are just as computationally expensive as a reasonable block cipher, if not worse. For example, SHA-1 requires doing the following for each 512-bit block:
Allocate 12 32-bit variables.
Allocate 80 32-bit words for the expanded message
64 times: Perform three array lookups, three 32-bit xors, and a rotate operation
80 times: Perform up to five 32-bit binary operations (some combination of xor, and, or, not, and and depending on the round); then a rotate, array lookup, four adds, another rotate, and several memory loads/stores.
Perform five 32-bit twos-complement adds
There is one chunk per 512-bits of the message, plus a possible extra chunk at the end. This is 1136 binary operations per chunk (not counting memory operations), or about 16 operations per byte.
For comparison, the RC4 encryption algorithm requires four operations (three additions, plus an xor on the message) per byte, plus two array reads and two array writes. It also requires only 258 bytes of working memory, vs a peak of 368 bytes for SHA-1.
Key management is fundamental
With any confidentiality system, you must have some sort of secret. If you have no secrets, then anyone else can implement the same decoding algorithm, and your data is exposed to the world.
So, you have two choices as to where to put the secrecy. One option is to make the encipherpent/decipherment algorithms secret. However, if the code (or binaries) for the algorithm is ever leaked, you lose - it's quite hard to replace such an algorithm.
Thus, secrets are generally made easy to replace - this is what we call a key.
Your proposed usage of hash algorithms would require a salt - this is the only secret in the system and is therefore a key. Whether you like it or not, you will have to manage this key carefully. And it's a lot harder to replace if leaked than other keys - you have to spend many CPU-hours generating a new rainbow table every time it's changed!
What should you do?
Use a real encryption algorithm, and spend some time actually thinking about key management. These issues have been solved before.
First, use a real encryption algorithm. AES has been designed for high performance and low RAM requirements. You could also use a stream cipher like RC4 as I mentioned before - the thing to watch out for with RC4, however, is that you must discard the first 4 kilobytes or so of output from the cipher, or you will be vulnerable to the same attacks that plauged WEP.
Second, think about key management. One option is to simply burn a key into each client, and physically go out and replace it if the client is compromised. This is reasonable if you have easy physical access to all of the clients.
Otherwise, if you don't care about man-in-the-middle attacks, you can simply use Diffie-Hellman key exchange to negotiate a shared key between S and C. If you are concerned about MitMs, then you'll need to start looking at ECDSA or something to authenticate the key obtained from the D-H exchange - beware that when you start going down that road, it's easy to get things wrong, however. I would recommend implementing TLS at that point. It's not beyond the capabilities of an embedded system - indeed, there are a number of embedded commercial (and open source) libraries available already. If you don't implement TLS, then at least have a professional cryptographer look over your algorithm before implementing it.
There is obviously no such thing as a "perfect" hash unless you have at least as many hash buckets as inputs; if you don't, then inevitably it will be possible for two of your inputs to share the same hash bucket.
However, it's unlikely you'll be storing all the numbers between 0 and 10^11. So what's the pattern? If there's a pattern, there may be a perfect hash function for your actual data set.
It's really not that important to find a "perfect" hash function anyway, though. Hash tables are very fast. A function with a very low collision rate - and when hashing integers, that means nearly any simple function, like modulus - is fine and you'll get O(1) average performance.