Generating an activation key from a large set of serial numbers and activation keys - numbers

I have a bunch of serial numbers and their corresponding activation keys for some old software. Since installing them originally I have lost a number of the activation keys (but still have the serial number). I still have a data set of about 20 keys and even eyeballing it I can tell there is a method to the madness in determining the the activation keys. Given my large data set is there a way I can backsolve to figure out the activation keys for the information I lost.
example of serial #: 14051 Activation Key: E9E9F-9993432-45543

What you're trying to do is come up with a function that maps serial numbers to activation keys. Without knowing more about the nature of the function, this could be anywhere from very easy (a polynomial with only a few terms) to very hard (a multi-tiered function involving lots of block XORs, substitution tables, complicated key schedules, ...).
If you have access to the key verifier routine (e.g. by disassembly - which is almost always against the EULAs of commercial software), then you have a routine that returns whether or not a given activation key is correct for a given serial number. If this was done by computing an activation key for a serial number, then you are practically done. If this was done by computing the inverse function on the key, then your task is a little harder: you need to invert that function to retrieve the key derivation algorithm, which may not be so easy. If you end up having to solve some hard mathematical problems (e.g. the discrete logarithm problem) because the scheme depends on public-key cryptography, then you're hoping that the values you're dealing with are small enough that you can brute-force or use a known algorithm (e.g. Pollard's rho algorithm) in computationally feasible time.
In any case, you'll need to get comfortable with disassembly and debugging, and hope that there are no anti-debugger measures in place.
Otherwise, the problem is much harder - you'd need to make some educated guesses and try them (e.g. by trying to do a polynomial fit), and hope for the best. Because of the very large variety of different possible functions that can fit any set of inputs and outputs (mathematically uncountable, though in practice limited by source code size), trying to do a known-plaintext attack on the algorithm itself is generally infeasible.

It depends on how dumb the scheme was in the first place, but my guess would be that it's not likely. There's no fixed methodology, but the general domain is the same as codebreaking.

Related

How do you choose an optimal PlainModulus in SEAL?

I am currently learning how to use SEAL and in the parameters for BFV scheme there was a helper function for choosing the PolyModulus and CoeffModulus and however this was not provided for choosing the PlainModulus other than it should be either a prime or a power of 2 is there any way to know which optimal value to use?
In the given example the PlainModulus was set to parms.PlainModulus = new SmallModulus(256); Is there any special reason for choosing the value 256?
In BFV, the plain_modulus basically determines the size of your data type, just like in normal programming when you use 32-bit or 64-bit integers. When using BatchEncoder the data type applies to each slot in the plaintext vectors.
How you choose plain_modulus matters a lot: the noise budget consumption in multiplications is proportional to log(plain_modulus), so there are good reasons to keep it as small as possible. On the other hand, you'll need to ensure that you don't get into overflow situations during your computations, where your encrypted numbers exceed plain_modulus, unless you specifically only care about correctness of the results modulo plain_modulus.
In almost all real use-cases of BFV you should want to use BatchEncoder to not waste plaintext/ciphertext polynomial space, and this requires plain_modulus to be a prime. Therefore, you'll probably want it to be a prime, except in some toy examples.

Is there a way to verify a common seed to a cumulative sequence of hashes with unknown repetitions between each value presented?

I am writing a variant of the Cuckoo Cycle that uses an adjacency list for presenting solutions from two pairs of 8 bit coordinates, and I am not having any problems finding what I think should be an optimal solver for it, that uses two pairs of head/tail binary search trees to keep track of possible solution nodes, reject (branches) nodes and a binary tree that keeps a list of the candidate cycles as they are being assembled (as I understand it, binary search trees shorten the amount of processing for finding duplicates), but I need to refine the verifier function for solutions.
I see in Cuckoo that there is some process by which it modifies the edges with XOR functions and masks to identify a valid cycle, but I have two issues.
One is that each hash is generated from the previous hash, starting with the nonce, and proving that all offered node/edge pairs are valid derivatives from the nonce seems to me to require the verifier to repeat the hash function each time checking for a match until it gets a hit, which could be up to several thousand, in the worst case. Is there some property that can be used to shortcut this identification process, since unlike protection against DoS, we are providing the salt of the hash?
Second is that even if the presented cycle is perfectly valid, it is possible that one or more of the node/edge pairs in the cycle has a duplicate coordinate. The hashes are 32 bits long, and each coordinate is 8 bits. The answer to this probably has some relation to the previous question also, as having the seed for a hash function is a known security risk because of collisions. So obviously, as well as verifying the nodes are part of a cycle in the lowest possible values in the finite field, I need a way to be sure that a pair does not overlap with another possible, and branching pair.
I will be studying the verifier closer in the Cuckoo Cycle implementation to see if I can figure out how the algorithm ensures it is not approving a cycle that actually has a branch (and thus is invalid), but I thought I'd pop the question on this site in case someone knows better the ways of recognising hashes from a common seed, and if there is any way to recognise a 50% collision between a given coordinate and another one.
Note: After thinking about it for a while, I realised that I could solve the 'fake cycle' with one or more nodes having a branch by simply splitting the heads and tails into separate hashes, subsequent (odd then even), such as Murmur3 16 bit hashes.
And further thinking about it, I realised that Cuckoo Cycle is actually a special type of hash collision search that seeks only collisions that occur only once in the low order of the finite field. I am devising a new scheme called Hummingbird, which instead will not target the smallest numbers (which is also the same thing done by hashcash) but instead will target the most proximate hashes in a chain to the seed nonce. This will mean that attempts to insert branched nodes in the graph of the solution will be discovered in the verification. Which will probably take about 2-5 seconds depending on how deep. These solutions could be eliminated by specifying a maximum hash chain length as part of the consensus.
I just wanted to add that I answered my own question by realising that I am looking for, essentially, a hash collision, in my algorithm, and the simplest solution, with the least bit-twiddling was to make each coordinate a distinct hash in a hash chain (hash of nonce, then hash of hash, etc)
I didn't understand fully that Cuckoo Cycle is essentially a search for partial hash collisions, and when that dawned on me, I realised that the simple solution is to just make it into a search for hash collisions.
I have, from this realisation, moved very quickly forward to figuring out how my variation of Cuckoo can be much more simply implemented, as well as how to structure the B-tree based progressive search algorithm, the difficulty adjustment, and the rest.
I wasn't aware there was a stackexchange specialist site for math, or cryptography, or I would have posted it there instead. I studied FEC a few months ago and that opened the floodgates to a whole bunch of other ideas that led me to getting so worked up about Cuckoo Cycle. I believe I have generalised the Cuckoo Cycle into a generic, parameterisable graph theoretic proof of work and I will get back to finishing my implementation.
Thanks to everyone who submitted an answer, I will upvote as I deem correct, though I have zero or nearly zero rep, for what it's worth.

improve hashing using genetic programming/algorithm

I'm writing a program which can significantly lessen the number of collisions that occur while using hash functions like 'key mod table_size'. For this I would like to use Genetic Programming/Algorithm. But I don't know much about it. Even after reading many articles and examples I don't know that in my case (as in program definition) what would be the fitness function, target (target is usually the required result), what would pose as the population/individuals and parents, etc.
Please help me in identifying the above and with a few codes/pseudo-codes snippets if possible as this is my project.
Its not necessary to be using genetic programming/algorithm, it can be anything using evolutionary programming/algorithm.
thanks..
My advice would be: don't do this that way. The literature on hash functions is vast and we more or less understand what makes a good hash function. We know enough mathematics not to look for them blindly.
If you need a hash function to use, there is plenty to choose from.
However, if this is your uni project and you cannot possibly change the subject or steer it in a more manageable direction, then as you noticed there will be complex issues of getting fitness function and mutation operators right. As far as I can tell off the top of my head, there are no obvious candidates.
You may look up e.g. 'strict avalanche criterion' and try to see if you can reason about it in terms of fitness and mutations.
Another question is how do you want to represent your function? Just a boolean expression? Something built from word operations like AND, XOR, NOT, ROT ?
Depending on your constraints (or rather, assumptions) the question of fitness and mutation will be different.
Broadly fitness is clearly minimize the number of collisions in your 'hash modulo table-size' model.
The obvious part is to take a suitably large and (very important) representative distribution of keys and chuck them through your 'candidate' function.
Then you might pass them through 'hash modulo table-size' for one or more values of table-size and evaluate some measure of 'niceness' of the arising distribution(s).
So what that boils down to is what table-sizes to try and what niceness measure to apply.
Niceness is context dependent.
You might measure 'fullest bucket' as a measure of 'worst case' insert/find time.
You might measure sum of squares of bucket sizes as a measure of 'average' insert/find time based on uniform distribution of amongst the keys look-up.
Finally you would need to decide what table-size (or sizes) to test at.
Conventional wisdom often uses primes because hash modulo prime tends to be nicely volatile to all the bits in hash where as something like hash modulo 2^n only involves the lower n-1 bits.
To keep computation down you might consider the series of next prime larger than each power of two. 5(>2^2) 11 (>2^3), 17 (>2^4) , etc. up to and including the first power of 2 greater than your 'sample' size.
There are other ways of considering fitness but without a practical application the question is (of course) ill-defined.
If your 'space' of potential hash functions don't all have the same execution time you should also factor in 'cost'.
It's fairly easy to define very good hash functions but execution time can be a significant factor.

Can a cryptographic hash algorithm be used as a PRNG?

Can MD5/SHA256/SHA512, etc., be used as a PRNG? E.g., given an integer seed, is the pseudo-code:
random_number = truncate_to_desired_range(
sha512( seed.toString() + ',' + i.toString() )
…a decent PRNG? (i is an increasing integer, e.g., the outputs are:
convert(sha512("<seed>,0"))
convert(sha512("<seed>,1"))
convert(sha512("<seed>,2"))
convert(sha512("<seed>,3"))
…
"Decent", in the context of this question, refers only to the distribution of the output: is the output of cryptographic hash functions uniform, when used this way? (Though I suppose it would depend on the hash function, all cryptographic hashes should also have uniform output, right?)
Note: I will concede that this is going to be a slow PRNG, compared to say Mersenne-Twister, due to the use of a cryptographic hash. I'm not interested in speed, and I'm not interested in the result being secure — just that the distribution is correct.
In my particular use case, I'm looking for something similar to XKCD's geohashing, in that it is easily implemented by distributed parties, who will all arrive at the same answer. Mersenne-Twister can be substituted, but it less available in many target languages. (Some languages lack it entirely, some lack access to the raw U32 output of it, etc. SHA512 is either built in, or easily available.)
Assuming the cryptographic hash function meets its design goals, the output will (provably) follow a uniform distribution over its period, as every input to the hash function is unique by design.
One of the goals of a hash function is to approximate a random oracle, that is, for any two distinct inputs A and B, the outputs H(A) and H(B) should (for a true random oracle) be uncorrelated. Hash functions get pretty close to that, but of course weaknesses creep in with time and cryptanalysis.
That said, cryptographic primitives are essentially the best mathematical algorithms we have available when it comes to quality, therefore it is safe to say that if they cannot solve your problem, nothing will.
It can be made to work (with good sized inputs, padding, etc. as mentioned in other answers/comments) and will provide reasonably good results, but it's going to be slow, so don't do that if you are doing simulations or something that require heavy PRNG use...

Hashing Similarity

Normally, the goal of hashing is to turn a continuous function into a discrete one: a small change in the input should cause a large change in the output. However, is there any hashing algorithm that will, (very) roughly speaking, return similar but (still different) hashes for similar inputs?
(An example of the use of this would be to check whether two files are "similar" by checking their hashes for similarity. Of course, some failure is always acceptable.)
Look at Locality Sensitive Hashing (LSH). That is a probabilistic way of quickly finding a bunch of points near a given one, for example.
Given a distance function that tells you how similar or different are your objects, you can also employ distance permutations:
http://www.computer.org/portal/web/csdl/doi/10.1109/TPAMI.2007.70815
or sketches:
http://portal.acm.org/citation.cfm?id=1638180
For an implementation of the latter approach:
http://obsearch.net
You really don't want to see similar hashes. Hashing is to insure Integrity, therefore the slightest change in your file/app/program will produce an entirely different hash. If two different strings show the same hash, this is called a collision, and the hashing algorithm is now compromised. MD5 has some collisions but is still used today.