Why AES-GCM with random nonce should be rotated every 200k writes in Kubernetes? - kubernetes

We are implementing encryption at rest in Kubernetes by this tutorial (https://kubernetes.io/docs/tasks/administer-cluster/encrypt-data/) and we are absolutely not sure why AES-GCM encryption provider requires to rotate the key every 200K writes, because of lack of knowledge about how encryption works. Also, what exactly means: "200K writes", how can we define that we should rotate the key? Thank you

we are absolutely not sure why AES-GCM encryption provider requires to rotate
The GCM mode is basically a CTR streaming mode with built-in integrity validation (message authentication code). For this mode is very important to prevent reusing the same IV/key pair. It is advised to limit amount of the content encrypted with the same key limiting probability of the nonce-collision and options for key analysis (these is some math behind, already referred in the comments).
Yes, 200k is an arbitrary number, but someone has to state a reasonable number where nonce-collision probability is still negligible and the key is usable for significant time.
what exactly means: "200K writes",
This is usually very hard to estimate, depending what is the "write". It may be different if you use the key to encrypt other random keys (as a wrapping key) or the key is used to encrypt a lot of continuous content (e.g. a storage).
how can we define that we should rotate the key?
Let's be practical, e.g. AWS KMS provides automatic key rotation every year. Based on the question, assuming the key is used to encrypt the etcd storage (configuration), a yearly rotation can be a safe option. (I expect you don't have 200k secrets and config maps in the k8s cluster).
The key rotation process usually creates a new key (key version) and new content is encrypted using a new key. The existing content is still possible to decrypt using the older keys.
In this regard I have a little concern about how the key rotation is described in the documentation. Basically the steps 1-4 look ok, a new encryption key is defined and put in force. The step 5 and 6 are to re-encrypt all the etcd content using the new key basically limiting (if not defying) the whole purpose of the key rotation. Maybe you could pick that up with the support if you have time and patience to dig in.

As per OpenShift docs:
https://docs.openshift.com/container-platform/3.11/admin_guide/encrypting_data.html
Kubernetes has no proper nonce generator and uses a random IV as nonce for AES-GCM. Since AES-GCM requires a proper nonce to be secure, AES-GCM is not recommended. The 200,000 write limit just limits the possibility of a fatal nonce misuse to a reasonable low margin.
It's just an arbitrary number from what I can see.

Related

What hash algorithm is appropriate for reducing key size

I have the need for storing key-value pairs, where the key is supposed to be unique. It will be both kept in-memory as well as in binary format on disk. The key is also part of a custom message protocol sent over TCP.
Was first thinking of not supporting what-ever size you want of the key but instead limiting it to X-chars.
Is there an applicable hashing algorithm (no need for security) that would be applicable for use instead, to reduce the length etc of the key, but still being good enough for uniqueness?

Key-Value Store: Overcoming key length limit

How do I enforce a unique constraint in Key-Value store where the unique data is longer than the key length limit?
I currently use CouchBase to store the document below:
{
url: "http://google.com",
siteName: "google.com",
data:
{
//more properties
}
}
Unique constraint is defined at url + siteName. I however can't use those properties as the key since the length can be longer than the key length limit of CouchBase.
I currently have two solutions in mind but I think that both are not good enough.
Solution 1
Document key is the SHA1 hash of url + siteName.
Advantages: easy to implement
Disadvantages: collisions can occur
Solution 2
Document key is the hash(url + siteName) + index.
This is same as Solution 1 but key includes index in-case a collision occurs.
To add a document, the application server:
set index to 0
Store document with the key = hash(url + siteName) + index
If duplicate key conflict occurred, read document back
Does existing document have same url and sitename with the one we are storing?
If yes, throw an exception is duplicates aren't allowed
If no, increment index and go back to step 2
This is currently my favorite solution because it can handle collisions
I a NoSQL n00b! How can I enforce unique constraints in a Key-Value store?
After reading your question, here are my thoughts/opinions, which I think should help give rationale for choosing your first option.
Couchbase is an in-memory cache/dictionary. To store many (read "very large incomprehensible number") values, it requires both RAM and disk space. Regardless of how much space each document occupies, all of the document keys are stored in RAM. If you were therefore permitted to store an arbitrarily large value for the key, your server farm would consume RAM faster than you could supply it, and your design would fall apart.
With item #1 being the case, your application needs to be designed such that key sizes are as small as practicable. Dictionary key/hash value computation is up to application API (in the same way that this is left to the .Net or Java API - which likewise compute hashes on the string inputs). The same method to produce a hash should be used regardless of input, for the sake of consistency.
The SHA1 has has an extremely low collision probability, and it is designed that way to make "breaking" of the encryption computationally infeasible. This is the foundation behind the "fingerprint" in bitcoins. See here and here for tasty reading on the topic.
Given what I know about hashes, and given the fact that URLs always start with the same set of characters, this theoretically lowers the likelihood of collision even further.
If you are, in fact, storing enough documents that the odds of a SHA1 collision are significant, then there are almost certainly at least a dozen other issues that will affect your application's usability and reliability in a more significant way, and you should devote your energy to thinking about those things.
The hard part about being an engineer is recognizing the need to take a step back from the engineering and say when "good" is "good enough." That being said, option 1 looks like the best choice, it's simple and consistent. If properly applied, that's all you need. Check the box on this one and move on to your next issue.
I’d go for solution 1 however for choosing the hashing function you should consider the following things:
how many data you have? => how large should be the generated hash in order to reduce the probability of colisions to a minimum? - here the best might be SHA-512 which has 512 bits large output hash, compared to the 160 bits from SHA-1
what performance do you need from the hashing function? SHA-x are pretty slow compared to md5 and depending on the number of items you want to store md5 could be pretty good as well.
in the end you can also have a combination, use sitename+url as a key if it is short enough, switch to sitename+hash(url) in case this combination can be short enough and in the end only hash both together.
on a related note I’ve found also this question http://www.couchbase.com/communities/q-and-a/key-size-limits-couchbasemembase-again where one answer suggests to compress the keys if it is possible for you.
You could actually use normal gzip compression and encode the text. I’m not sure how well this would work on your usecase, you’ll have to check it, but I used it for JSON files and managed to reduce it down to ~20% - however it was a huge 8MB file so the compression possibilities for your key might be much lower.

Strong one way hash for weak passwords

I searching for a strong one way hashing function which encrypts (really) weak passwords (10^9 combinations). The crypt function also must fulfill some requirements:
Alwas same hash from same clear text. So scrypt/bcrypt and public/private key methods are not possible (or may I be wrong here?).
No shared secret as in AES. As the same hashes have to be created by different clients.
No salts
So what could be done to increase the difficulty of bruteforcing against a such small character space? I already tried key stretching with multiple rounds of SHA256 but i am not how many rounds are required to significally increase computation time (Must be in the billions I guess).
The only thing i came up with so far is using a server side secret which gets added to the password. But in case of corruption it is difficult to guarantee that the secret is still a secret...
I would be glad for some hints or ideas!
regards,
r0cks
This is not possible using just an algorithm. You are better off splitting off your system into parts with specific roles to protect your data.
scrypt, bcrypt and PBKDF2 are all deterministic Password Based Key Derivation Functions (PBKDFs). As long as the salt stays the same, they will reproduce the same result for the same input. (Parts of) the salt may be a server side secret. If the salt gets exposed those functions will however not help much if the password is weak.
It is not easy to give a good answer without more context.
A server side secret can only protect the ids(?) for storing them server side (database). Even then it only protects the ids as long as the secret is not known (the problem of every two-way encryption).
A one-way hash on the other side would protect the ids even when the code and the database was stolen. A BCrypt/PBKDF2 hash would slow down brute-forcing, even with a static salt, though the static salt would allow to build a single rainbow-table to get all hashes at once.
Using BCrypt and afterwards encrypt the hash with a server side secret is possibly your best bet, though it is difficult to say without knowing more of the scenario.

How safe is it to rely on hashes for file identification?

I am designing a storage cloud software on top of a LAMP stack.
Files could have an internal ID, but it would have many advantages to store them not with an incrementing id as filename in the servers filesystems, but using an hash as filename.
Also hashes as identifier in the database would have a lot of advantages if the currently centralized database should be sharded or decentralized or some sort of master-master high availability environment should be set up. But I am not sure about that yet.
Clients can store files under any string (usually some sort of path and filename).
This string is guaranteed to be unique, because on the first level is something like "buckets" that users have go register like in Amazon S3 and Google storage.
My plan is to store files as hash of the client side defined path.
This way the storage server can directly serve the file without needing the database to ask which ID it is because it can calculate the hash and thus the filename on the fly.
But I am afraid of collisions. I currently think about using SHA1 hashes.
I heard that GIT uses hashes also revision identifiers as well.
I know that the chances of collisions are really really low, but possible.
I just cannot judge this. Would you or would you not rely on hash for this purpose?
I could also us some normalization of encoding of the path. Maybe base64 as filename, but i really do not want that because it could get messy and paths could get too long and possibly other complications.
Assuming you have a hash function with "perfect" properties and assuming cryptographic hash functions approach that the theory that applies is the same that applies to birthday attacks . What this says is that given a maximum number of files you can make the collision probability as small as you want by using a larger hash digest size. SHA has 160 bits so for any practical number of files the probability of collision is going to be just about zero. If you look at the table in the link you'll see that a 128 bit hash with 10^10 files has a collision probability of 10^-18 .
As long as the probability is low enough I think the solution is good. Compare with the probability of the planet being hit by an asteroid, undetectable errors in the disk drive, bits flipping in your memory etc. - as long as those probabilities are low enough we don't worry about them because they'll "never" happen. Just take enough margin and make sure this isn't the weakest link.
One thing to be concerned about is the choice of the hash function and it's possible vulnerabilities. Is there any other authentication in place or does the user simply present a path and retrieve a file?
If you think about an attacker trying to brute force the scenario above they would need to request 2^18 files before they can get some other random file stored in the system (again assuming 128 bit hash and 10^10 files, you'll have a lot less files and a longer hash). 2^18 is a pretty big number and the speed you can brute force this is limited by the network and the server. A simple lock the user out after x attempts policy can completely close this hole (which is why many systems implement this sort of policy). Building a secure system is complicated and there will be many points to consider but this sort of scheme can be perfectly secure.
Hope this is useful...
EDIT: another way to think about this is that practically every encryption or authentication system relies on certain events having very low probability for its security. e.g. I can be lucky and guess the prime factor on a 512 bit RSA key but it is so unlikely that the system is considered very secure.
Whilst the probability of a collision might be vanishingly small, imagine serving a highly confidential file from one customer to their competitor just because there happens to be a hash collision.
= end of business
I'd rather use hashing for things that were less critical when collisions DO occur ;-)
If you have a database, store the files under GUIDs - so not an incrementing index, but a proper globally unique identifier. They work nicely when it comes to distributed shards / high availability etc.
Imagine the worst case scenario and assume it will happen the week after you are featured in wired magazine as an amazing startup ... that's a good stress test for the algorithm.

is perfect hashing without buckets possible?

I've been asked to look for a perfect hash/one way function to be able to hash 10^11 numbers.
However as we'll be using a embedded device it wont have the memory to store the relevant buckets so I was wondering if it's possible to have a decent (minimal) perfect hash without them?
The plan is to use the device to hash the number(s) and we use a rainbow table or a file using the hash as the offset.
Cheers
Edit:
I'll try to provide some more info :)
1) 10^11 is actually now 10^10 so that makes it easer.This number is the possible combinations. So we could get a number between 0000000001 and 10000000000 (10^10).
2) The plan is to us it as part of a one way function to make the number secure so we can send it by insecure means.
We will then look up the original number at the other end using a rainbow table.
The problem is that the source the devices generally have 512k-4Meg of memory to use.
3) it must be perfect - we 100% cannot have a collision .
Edit2:
4) We cant use encryption as we've been told it's not really possable on the devices and keymanigment would be a nightmare if we could.
Edit3:
As this is not sensible, Its purely academic question now (I promise)
Okay, since you've clarified what you're trying to do, I rewrote my answer.
To summarize: Use a real encryption algorithm.
First, let me go over why your hashing system is a bad idea.
What is your hashing system, anyway?
As I understand it, your proposed system is something like this:
Your embedded system (which I will call C) is sending some sort of data with a value space of 10^11. This data needs to be kept confidential in transit to some sort of server (which I will call S).
Your proposal is to send the value hash(salt + data) to S. S will then use a rainbow table to reverse this hash and recover the data. salt is a shared value known to both C and S.
This is an encryption algorithm
An encryption algorithm, when you boil it down, is any algorithm that gives you confidentiality. Since your goal is confidentiality, any algorithm that satisfies your goals is an encryption algorithm, including this one.
This is a very poor encryption algorithm
First, there is an unavoidable chance of collision. Moreover, the set of colliding values differs each day.
Second, decryption is extremely CPU- and memory-intensive even for the legitimate server S. Changing the salt is even more expensive.
Third, although your stated goal is avoiding key management, your salt is a key! You haven't solved key management at all; anyone with the salt will be able to crack the message just as well as you can.
Fourth, it's only usable from C to S. Your embedded system C will not have enough computational resources to reverse hashes, and can only send data.
This isn't any faster than a real encryption algorithm on the embedded device
Most secure hashing algorithm are just as computationally expensive as a reasonable block cipher, if not worse. For example, SHA-1 requires doing the following for each 512-bit block:
Allocate 12 32-bit variables.
Allocate 80 32-bit words for the expanded message
64 times: Perform three array lookups, three 32-bit xors, and a rotate operation
80 times: Perform up to five 32-bit binary operations (some combination of xor, and, or, not, and and depending on the round); then a rotate, array lookup, four adds, another rotate, and several memory loads/stores.
Perform five 32-bit twos-complement adds
There is one chunk per 512-bits of the message, plus a possible extra chunk at the end. This is 1136 binary operations per chunk (not counting memory operations), or about 16 operations per byte.
For comparison, the RC4 encryption algorithm requires four operations (three additions, plus an xor on the message) per byte, plus two array reads and two array writes. It also requires only 258 bytes of working memory, vs a peak of 368 bytes for SHA-1.
Key management is fundamental
With any confidentiality system, you must have some sort of secret. If you have no secrets, then anyone else can implement the same decoding algorithm, and your data is exposed to the world.
So, you have two choices as to where to put the secrecy. One option is to make the encipherpent/decipherment algorithms secret. However, if the code (or binaries) for the algorithm is ever leaked, you lose - it's quite hard to replace such an algorithm.
Thus, secrets are generally made easy to replace - this is what we call a key.
Your proposed usage of hash algorithms would require a salt - this is the only secret in the system and is therefore a key. Whether you like it or not, you will have to manage this key carefully. And it's a lot harder to replace if leaked than other keys - you have to spend many CPU-hours generating a new rainbow table every time it's changed!
What should you do?
Use a real encryption algorithm, and spend some time actually thinking about key management. These issues have been solved before.
First, use a real encryption algorithm. AES has been designed for high performance and low RAM requirements. You could also use a stream cipher like RC4 as I mentioned before - the thing to watch out for with RC4, however, is that you must discard the first 4 kilobytes or so of output from the cipher, or you will be vulnerable to the same attacks that plauged WEP.
Second, think about key management. One option is to simply burn a key into each client, and physically go out and replace it if the client is compromised. This is reasonable if you have easy physical access to all of the clients.
Otherwise, if you don't care about man-in-the-middle attacks, you can simply use Diffie-Hellman key exchange to negotiate a shared key between S and C. If you are concerned about MitMs, then you'll need to start looking at ECDSA or something to authenticate the key obtained from the D-H exchange - beware that when you start going down that road, it's easy to get things wrong, however. I would recommend implementing TLS at that point. It's not beyond the capabilities of an embedded system - indeed, there are a number of embedded commercial (and open source) libraries available already. If you don't implement TLS, then at least have a professional cryptographer look over your algorithm before implementing it.
There is obviously no such thing as a "perfect" hash unless you have at least as many hash buckets as inputs; if you don't, then inevitably it will be possible for two of your inputs to share the same hash bucket.
However, it's unlikely you'll be storing all the numbers between 0 and 10^11. So what's the pattern? If there's a pattern, there may be a perfect hash function for your actual data set.
It's really not that important to find a "perfect" hash function anyway, though. Hash tables are very fast. A function with a very low collision rate - and when hashing integers, that means nearly any simple function, like modulus - is fine and you'll get O(1) average performance.