How do you choose an optimal PlainModulus in SEAL? - seal

I am currently learning how to use SEAL and in the parameters for BFV scheme there was a helper function for choosing the PolyModulus and CoeffModulus and however this was not provided for choosing the PlainModulus other than it should be either a prime or a power of 2 is there any way to know which optimal value to use?
In the given example the PlainModulus was set to parms.PlainModulus = new SmallModulus(256); Is there any special reason for choosing the value 256?

In BFV, the plain_modulus basically determines the size of your data type, just like in normal programming when you use 32-bit or 64-bit integers. When using BatchEncoder the data type applies to each slot in the plaintext vectors.
How you choose plain_modulus matters a lot: the noise budget consumption in multiplications is proportional to log(plain_modulus), so there are good reasons to keep it as small as possible. On the other hand, you'll need to ensure that you don't get into overflow situations during your computations, where your encrypted numbers exceed plain_modulus, unless you specifically only care about correctness of the results modulo plain_modulus.
In almost all real use-cases of BFV you should want to use BatchEncoder to not waste plaintext/ciphertext polynomial space, and this requires plain_modulus to be a prime. Therefore, you'll probably want it to be a prime, except in some toy examples.

Related

What accounts for most of the integer multiply instructions?

The majority of integer multiplications don't actually need multiply:
Floating-point is, and has been since the 486, normally handled by dedicated hardware.
Multiplication by a constant, such as for scaling an array index by the size of the element, can be reduced to a left shift in the common case where it's a power of two, or a sequence of left shifts and additions in the general case.
Multiplications associated with accessing a 2D array, can often be strength reduced to addition if it's in the context of a loop.
So what's left?
Certain library functions like fwrite that take a number of elements and an element size as runtime parameters.
Exact decimal arithmetic e.g. Java's BigDecimal type.
Such forms of cryptography as require multiplication and are not handled by their own dedicated hardware.
Big integers e.g. for exploring number theory.
Other cases I'm not thinking of right now.
None of these jump out at me as wildly common, yet all modern CPU architectures include integer multiply instructions. (RISC-V omits them from the minimal version of the instruction set, but has been criticized for even going this far.)
Has anyone ever analyzed a representative sample of code, such as the SPEC benchmarks, to find out exactly what use case accounts for most of the actual uses of integer multiply (as measured by dynamic rather than static frequency)?

improve hashing using genetic programming/algorithm

I'm writing a program which can significantly lessen the number of collisions that occur while using hash functions like 'key mod table_size'. For this I would like to use Genetic Programming/Algorithm. But I don't know much about it. Even after reading many articles and examples I don't know that in my case (as in program definition) what would be the fitness function, target (target is usually the required result), what would pose as the population/individuals and parents, etc.
Please help me in identifying the above and with a few codes/pseudo-codes snippets if possible as this is my project.
Its not necessary to be using genetic programming/algorithm, it can be anything using evolutionary programming/algorithm.
thanks..
My advice would be: don't do this that way. The literature on hash functions is vast and we more or less understand what makes a good hash function. We know enough mathematics not to look for them blindly.
If you need a hash function to use, there is plenty to choose from.
However, if this is your uni project and you cannot possibly change the subject or steer it in a more manageable direction, then as you noticed there will be complex issues of getting fitness function and mutation operators right. As far as I can tell off the top of my head, there are no obvious candidates.
You may look up e.g. 'strict avalanche criterion' and try to see if you can reason about it in terms of fitness and mutations.
Another question is how do you want to represent your function? Just a boolean expression? Something built from word operations like AND, XOR, NOT, ROT ?
Depending on your constraints (or rather, assumptions) the question of fitness and mutation will be different.
Broadly fitness is clearly minimize the number of collisions in your 'hash modulo table-size' model.
The obvious part is to take a suitably large and (very important) representative distribution of keys and chuck them through your 'candidate' function.
Then you might pass them through 'hash modulo table-size' for one or more values of table-size and evaluate some measure of 'niceness' of the arising distribution(s).
So what that boils down to is what table-sizes to try and what niceness measure to apply.
Niceness is context dependent.
You might measure 'fullest bucket' as a measure of 'worst case' insert/find time.
You might measure sum of squares of bucket sizes as a measure of 'average' insert/find time based on uniform distribution of amongst the keys look-up.
Finally you would need to decide what table-size (or sizes) to test at.
Conventional wisdom often uses primes because hash modulo prime tends to be nicely volatile to all the bits in hash where as something like hash modulo 2^n only involves the lower n-1 bits.
To keep computation down you might consider the series of next prime larger than each power of two. 5(>2^2) 11 (>2^3), 17 (>2^4) , etc. up to and including the first power of 2 greater than your 'sample' size.
There are other ways of considering fitness but without a practical application the question is (of course) ill-defined.
If your 'space' of potential hash functions don't all have the same execution time you should also factor in 'cost'.
It's fairly easy to define very good hash functions but execution time can be a significant factor.

Efficient Function to Map (or Hash) Integers and Integer Ranges into Index

We are looking for the computationally simplest function that will enable an indexed look-up of a function to be determined by a high frequency input stream of widely distributed integers and ranges of integers.
It is OK if the hash/map function selection itself varies based on the specific integer and range requirements, and the performance associated with the part of the code that selects this algorithm is not critical. The number of integers/ranges of interest in most cases will be small (zero to a few thousand). The performance critical portion is in processing the incoming stream and selecting the appropriate function.
As a simple example, please consider the following pseudo-code:
switch (highFrequencyIntegerStream)
case(2) : func1();
case(3) : func2();
case(8) : func3();
case(33-122) : func4();
...
case(10,000) : func40();
In a typical example, there would be only a few thousand of the "cases" shown above, which could include a full range of 32-bit integer values and ranges. (In the pseudo code above 33-122 represents all integers from 33 to 122.) There will be a large number of objects containing these "switch statements."
(Note that the actual implementation will not include switch statements. It will instead be a jump table (which is an array of function pointers) or maybe a combination of the Command and Observer patterns, etc. The implementation details are tangential to the request, but provided to help with visualization.)
Many of the objects will contain "switch statements" with only a few entries. The values of interest are subject to real time change, but performance associated with managing these changes is not critical. Hash/map algorithms can be re-generated slowly with each update based on the specific integers and ranges of interest (for a given object at a given time).
We have searched around the internet, looking at Bloom filters, various hash functions listed on Wikipedia's "hash function" page and elsewhere, quite a few Stack Overflow questions, abstract algebra (mostly Galois theory which is attractive for its computationally simple operands), various ciphers, etc., but have not found a solution that appears to be targeted to this problem. (We could not even find a hash or map function that considered these types of ranges as inputs, much less a highly efficient one. Perhaps we are not looking in the right places or using the correct vernacular.)
The current plan is to create a custom algorithm that preprocesses the list of interesting integers and ranges (for a given object at a given time) looking for shifts and masks that can be applied to input stream to help delineate the ranges. Note that most of the incoming integers will be uninteresting, and it is of critical importance to make a very quick decision for as large a percentage of that portion of the stream as possible (which is why Bloom filters looked interesting at first (before we starting thinking that their implementation required more computational complexity than other solutions)).
Because the first decision is so important, we are also considering having multiple tables, the first of which would be inverse masks (masks to select uninteresting numbers) for the easy to find large ranges of data not included in a given "switch statement", to be followed by subsequent tables that would expand the smaller ranges. We are thinking this will, for most cases of input streams, yield something quite a bit faster than a binary search on the bounds of the ranges.
Note that the input stream can be considered to be randomly distributed.
There is a pretty extensive theory of minimal perfect hash functions that I think will meet your requirement. The idea of a minimal perfect hash is that a set of distinct inputs is mapped to a dense set of integers in 1-1 fashion. In your case a set of N 32-bit integers and ranges would each be mapped to a unique integer in a range of size a small multiple of N. Gnu has a perfect hash function generator called gperf that is meant for strings but might possibly work on your data. I'd definitely give it a try. Just add a length byte so that integers are 5 byte strings and ranges are 9 bytes. There are some formal references on the Wikipedia page. A literature search in ACM and IEEE literature will certainly turn up more.
I just ran across this library I had not seen before.
Addition
I see now that you are trying to map all integers in the ranges to the same function value. As I said in the comment, this is not very compatible with hashing because hash functions deliberately try to "erase" the magnitude information in a bit's position so that values with similar magnitude are unlikely to map to the same hash value.
Consequently, I think that you will not do better than an optimal binary search tree, or equivalently a code generator that produces an optimal "tree" of "if else" statements.
If we wanted to construct a function of the type you are asking for, we could try using real numbers where individual domain values map to consecutive integers in the co-domain and ranges map to unit intervals in the co-domain. So a simple floor operation will give you the jump table indices you're looking for.
In the example you provided you'd have the following mapping:
2 -> 0.0
3 -> 1.0
8 -> 2.0
33 -> 3.0
122 -> 3.99999
...
10000 -> 42.0 (for example)
The trick is to find a monotonically increasing polynomial that interpolates these points. This is certainly possible, but with thousands of points I'm certain you'ed end up with something much slower to evaluate than the optimal search would be.
Perhaps our thoughts on hashing integers can help a little bit. You will also find there a hashing library (hashlib.zip) based on Bob Jenkins' work which deals with integer numbers in a smart way.
I would propose to deal with larger ranges after the single cases have been rejected by the hashing mechanism.

Algorithm generation

I have a rather large(not too large but possibly 50+) set of conditions that must be placed on a set of data(or rather the data should be manipulated to fit the conditions).
For example, Suppose I have the a sequence of binary numbers of length n,
if n = 5 then a element in the data might be {0,1,1,0,0} or {0,0,0,1,1}, etc...
BUT there might be a set of conditions such as
x_3 + x_4 = 2
sum(x_even) <= 2
x_2*x_3 = x_4 mod 2
etc...
Because the conditions are quite complex in that they come from experiment(although they can be written down in logic form) and are hard to diagnose I would like instead to use a large sample set of valid data. i.e., Data I know satisfies the conditions and is a pretty large set. i.e., it is easier to collect the data then it is to deduce the conditions that the data must abide by.
Having said that, basically what I'm doing is very similar to neural networks. The difference is, I would like an actual algorithm, in some sense optimal, in some form of code that I can run instead of the network.
It might not be clear what I'm actually trying to do. What I have is a set of data in some raw format that is unique and unambiguous but not appropriate for my needs(in a sense the amount of data is too large).
I need to map the data into another set that actually is ambiguous to some degree but also has certain specific set of constraints that all the data follows(certain things just cannot happen while others are preferred).
The unique constraints and preferences are hard to figure out. That is, the mapping from the non-ambiguous set to the ambiguous set is hard to describe(which is why it is ambiguous). The goal, actually, is to have an unambiguous map by supplying the right constraints if at all possible.
So, on the vein of my initial example, I'm given(or supply) a set of elements and need some way to derive a list of constraints similar to what I've listed.
In a sense, I simply have a set of valid data and train it very similar to neural networks.
Then, after this "Training" I'm given the mapping function I can then use on any element in my dataset and it will produce a new element satisfying the constraint's, or if it can't, will give as close as possible an unambiguous result.
The main difference between neural networks and what I'm trying to achieve is I'd like to be able to use have an algorithm to code to be used instead of having to run a neural network. The difference here is the algorithm would probably be a lot less complex, not need potential retraining, and a lot faster.
Here is a simple example.
Suppose my "training set" are the binary sequences and mappings
01000 => 10000
00001 => 00010
01010 => 10100
00111 => 01110
then from the "Magical Algorithm Finder"(tm) I would get a mapping out like
f(x) = x rol 1 (rol = rotate left)
or whatever way one would want to express it.
Then I could simply apply f(x) to any other element, such as x = 011100 and could apply f to generate a hopefully unambiguous output.
Of course there are many such functions that will work on this example but the goal is to supply enough of the dataset to narrow it down to hopefully a few functions that make the most sense(at the very least will always map the training set correctly).
In my specific case I could easily convert my problem into mapping the set of binary digits of length m to the set of base B digits of length n. The constraints prevents some numbers from having an inverse. e.g., the mapping is injective but not surjective.
My algorithm could be a simple collection if statements acting on the digits if need be.
I think what you are looking for here is an application of Learning Classifier Systems, LCS -wiki. There are actually quite a few LCS open-source applications available, but you may need to experiment with the parameters in order to get a good result.
LCS/XCS/ZCS have the features that you are looking for including individual rules that could be heavily optimized, pressure to reduce the rule-set, and of course a human-readable/understandable set of rules. (Unlike a neural-net)

Hash operator in Matlab for linear indices of vectors

I am clustering a large set of points. Throughout the iterations, I want to avoid re-computing cluster properties if the assigned points are the same as the previous iteration. Each cluster keeps the IDs of its points. I don't want to compare them element wise, comparing the sum of the ID vector is risky (a small ID can be compensated with a large one), may be I should compare the sum of squares? Is there a hashing method in Matlab which I can use with confidence?
Example data:
a=[2,13,14,18,19,21,23,24,25,27]
b=[6,79,82,85,89,111,113,123,127,129]
c=[3,9,59,91,99,101,110,119,120,682]
d=[11,57,74,83,86,90,92,102,103,104]
So the problem is that if I just check the sum, it could be that cluster d for example, looses points 11,103 and gets 9,105. Then I would mistakenly think that there has been no change in the cluster.
This is one of those (very common) situations where the more we know about your data and application the better we are able to help. In the absence of better information than you provide, and in the spirit of exposing the weakness of answers such as this in that absence, here are a couple of suggestions you might reject.
One appropriate data structure for set operations is a bit-set, that is a set of length equal to the cardinality of the underlying universe of things in which each bit is set on or off according to the things membership of the (sub-set). You could implement this in Matlab in at least two ways:
a) (easy, but possibly consuming too much space): define a matrix with as many columns as there are points in your data, and one row for each cluster. Set the (cluster, point) value to true if point is a member of cluster. Set operations are then defined by vector operations. I don't have a clue about the relative (time) efficiency of setdiff versus rowA==rowB.
b) (more difficult): actually represent the clusters by bit sets. You'll have to use Matlab's bit-twiddling capabilities of course, but the pain might be worth the gain. Suppose that your universe comprises 1024 points, then you'll need an array of 16 uint64 values to represent the bit set for each cluster. The presence of, say, point 563 in a cluster requires that you set, for the bit set representing that cluster, bit 563 (which is probably bit 51 in the 9th element of the set) to 1.
And perhaps I should have started by writing that I don't think that this is a hashing sort of a problem, it's a set sort of a problem. Yeah, you could use a hash but then you'll have to program around the limitations of using a screwdriver on a nail (choose your preferred analogy).
If I understand correctly, to hash the ID's I would recommend using the matlab Java interface to use the Java hashing algorithms
http://docs.oracle.com/javase/1.4.2/docs/api/java/security/MessageDigest.html
You'll do something like:
hash = java.security.MessageDigest.getInstance('SHA');
Hope this helps.
I found the function
DataHash on FEX it is quiet fast for vectors and the strcmp on the keys is a lot faster than I expected.