Leakage through Ciphertext-Plaintext Homomorphic Operations - seal

Consider two parties, namely, P_0 and P_1. P_0 and P_1 have plaintexts p_a and p_b respectively.
P_0 encrypts p_a to get c_a = Enc(p_a) with its public key, and sends it to P_1.
P_1 performs multiply_plain(c_a, p_b, c), followed by sub_plain_inplace(c, p_R) (where p_R is a random plaintext polynomial to hide the product of a and b), and then sends c to P_0.
Can the noise in c reveal some information about p_b to P_0, despite the product being masked by p_R?
If yes, then how can I avoid this leakage? Is there a way to add random noise to c to drown the impact of p_b on noise in c?
Is there a function in SEAL to encrypt using noise from a larger interval? If there is, then maybe I can encrypt p_R with extra noise to drown the impact.

Yes, the noise can in theory reveal information about the inputs to the product, even after adding a fresh encryption to it. Homomorphic encryption schemes are typically not designed to provide input privacy in such MPC protocols. It's not clear to me how feasible this "attack" would be to execute in realistic application scenarios though (except in pathological cases).
To avoid this issue and to obtain semi-honest security for protocols you may want to build from the BFV scheme you can indeed do what you suggested: flood the noise by adding an encryption with artificially large noise. This was used for example here (section 5.2) to prove the security of the protocol. See also Lemma 1 in this paper.
A fancier bootstrapping-based approach is described in this paper by Ducas and Stehle. Since bootstrapping in both BGV and BFV is extremely restrictive (and not implemented in SEAL), I wouldn't consider this approach to be practical except perhaps in some very rare scenarios.

Related

Does it matter which algorithm you use for Multiple Imputation by Chained Equations (MICE)

I have seen MICE implemented with different types of algorithms e.g. RandomForest or Stochastic Regression etc.
My question is that does it matter which type of algorithm i.e. does one perform the best? Is there any empirical evidence?
I am struggling to find any info on the web
Thank you
Yes, (depending on your task) it can matter quite a lot, which algorithm you choose.
You also can be sure, the mice developers wouldn't out effort into providing different algorithms, if there was one algorithm that anyway always performs best. Because, of course like in machine learning the "No free lunch theorem" is also relevant for imputation.
In general you can say, that the default settings of mice are often a good choice.
Look at this example from the miceRanger Vignette to see, how far imputations can differ for different algorithms. (the real distribution is marked in red, the respective multiple imputations in black)
The Predictive Mean Matching (pmm) algorithm e.g. makes sure that only imputed values appear, that were really in the dataset. This is for example useful, where only integer values like 0,1,2,3 appear in the data (and no values in between). Other algorithms won't do this, so while doing their regression they will also provide interpolated values like on the picture to the right ( so they will provide imputations that are e.g. 1.1, 1.3, ...) Both solutions can come with certain drawbacks.
That is why it is important to actually assess imputation performance afterwards. There are several diagnostic plots in mice to do this.

Modulus Switching in SEAL library

Conventionally, modulus switching is primarily used to make the noise growth linear, as opposed to exponential. However, in the BFV examples, it has been introduced as a tool to shave off primes (thereby reducing the bitlength of coefficient modulus) and improve computational efficiency.
Does it help in reducing noise growth in the BFV scheme as well? Will I observe exponential growth in noise without (manually) switching modulus?
In BFV you don't need to do modulus switching because exponential noise growth is prevented by the scale invariance property. The main benefit of it is therefore in improving computational performance and perhaps communication cost.
For example, in some simple protocol Alice might encrypt data and send it to Bob who computes on it and sends the result back. If Alice only needs to decrypt the result, the parameters can just as well be as small as possible when Alice receives the result, so Bob should switch to smallest possible parameters before sending the data back to Alice to minimize the communication cost.

Output range for continuous control policy network

I tried to implement the simple vanilla policy gradient (REINFORCE) in a continuous control problem by adapting this pytorch implementation to the continuous case and I stumbled upon the following issue.
Usually, when the action space is discrete, the output of the policy network is bounded in (0,1)^n by the softmax function which gives the probability that the agent would pick a certain action given the state (input to the network). However, when the action space is continuous, for example if we have K action such that each action ak has lower and upper bounds lk anduk, I haven't found a way (empirical or theoretical) to limit the output of the network (which is usually the means and the standard deviations of the action probability distribution given the state) using lk and uk.
From the few trials I made, without constraining the output of the policy network, it was very hard, if not impossible, to learn a good policy, but i might be doing something wrong since i am new to reinforcement learning.
My intuition suggests me to limit the means and the standard deviations output of the policy network using, for example, a sigmoid and then scaling them with the absolute difference between lk and uk. I'm not quite sure how to do it properly though, considering also that the sampled action could exceed whatever bound you impose on the distribution parameters when using, for example, a gaussian distribution.
Am I missing something? Are there established ways to limit the output of the policy network for continuous action spaces or there's no need to do that at all?
I am not sure this is the right place for this question, if not I will be glad if you point to me a better place.

kmean clustering: variable selection

I'm applying a kmean algorithm for clustering my customer base. I'm struggling conceptually on the selection process of the dimensions (variables) to include in the model. I was wondering if there are methods established to compare among models with different variables. In particular, I was thinking to use the common SSwithin / SSbetween ratio, but I'm not sure if that can be applied to compare models with a different number of dimensions...
Any suggestions>?
Thanks a lot.
Classic approaches are sequential selection algorithms like "sequential floating forward selection" (SFFS) or "sequential floating backward elimination (SFBS). Those are heuristic methods where you eliminate (or add) one feature at the time based on your performance metric, e.g,. mean squared error (MSE). Also, you could use a genetic algorithm for that if you like.
Here is an easy-going paper that summarizes the ideas:
Feature Selection from Huge Feature Sets
And a more advanced one that could be useful: Unsupervised Feature Selection for the k-means Clustering Problem
EDIT:
When I think about it again, I initially had the question in mind "how do I select the k (a fixed number) best features (where k < d)," e.g., for computational efficiency or visualization purposes. Now, I think what you where asking is more like "What is the feature subset that performs best overall?" The silhouette index (similarity of points within a cluster) could be useful, but I really don't think you can really improve the performance via feature selection unless you have the ground truth labels.
I have to admit that I have more experience with supervised rather than unsupervised methods. Thus, I typically prefer regularization over feature selection/dimensionality reduction when it comes to tackling the "curse of dimensionality." I use dimensionality reduction frequently for data compression though.

results of two feature selection algo do not match

I am working on two feature selection algorithms for a real world problem where the sample size is 30 and feature size is 80. The first algorithm is wrapper forward feature selection using SVM classifier, the second is filter feature selection algorithm using Pearson product-moment correlation coefficient and Spearman's rank correlation coefficient. It turns out that the selected features by these two algorithms are not overlap at all. Is it reasonable? Does it mean I made mistakes in my implementation? Thank you.
FYI, I am using Libsvm + matlab.
It can definitely happen as both strategies do not have the same expression power.
Trust the wrapper if you want the best feature subset for prediction, trust the correlation if you want all features that are linked to the output/predicted variable. Those subsets can be quite different, especially if you have many redundant features.
Using top correlated features is a strategy which assumes that the relationships between the features and the output/predicted variable are linear, (or at least monotonous in case of Spearman's rank correlation), and that features are statistically independent one from another, and do not 'interact' with one another. Those assumptions are most often violated in real world problems.
Correlations, or other 'filters' such as mutual information, are better used either to filter out features, to decide which features not to consider, rather than to decide which features to consider. Filters are necessary when the initial feature count is very large (hundreds, thousands) to reduce the workload for a subsequent wrapper algorithm.
Depending on the distribution of the data you can either use spearman or pearson.The latter is used for normal distribution while former for non-normal.Find the distribution and use appropriate one.