PySpark Proportionate Stratified Sampling "sampleBy"

PySpark Proportionate Stratified Sampling "sampleBy" - pyspark

Question: If you implement proportionate stratified sampling using PySpark's sampleBy, isn't it just the same thing as a random sample?
Edit: there is proportionate and disproportionate stratified sampling. This question is about the former.
Here's my thinking on this:
Let's say you have 4 groups in a population of total size N = 1000. The groups have proportions:
A: 25%, B: 50%, C: 13%, and D: 12%
Then choosing a proportionate stratified sample of size 100 means choosing a sample consisting of exactly 25 elements from A, 50 elements from B, 13 elements from C and 12 elements from D. (Note: A disproportionate stratified sample would be if you had different sampling ratios than those of the population.)
This is in contrast to doing a random sample where the expected number of elements from A, B, C and D are 25, 50, 13, and 12 respectively.
It would be natural to implement proportionate stratified sampling in PySpark via the sampleBy method with fractions
fractions = {'A': .1, 'B': .1, 'C': .1, 'D': .1}
If this method sampled exactly, you'd have 25, 50, 13 and 12 elements respectively. However, this method is implemented with a Bernoulli trial (coin flipping). For stratified sampling, since all the fractions are identical, so each element is chosen with probability 10%.
In this case doing the Bernoulli trial by strata and then by element is the same as doing this over the entire data set. The latter is just random sampling.
Conclusion: Stratified sampling is "not possible" in this paradigm.
Is this a correct understanding?
I've seen some posts on doing exact sampling using special tricks. I'll see if I can answer my own post using these methods (3) below.
Note: There is a sampleByKeyExact method but it is not supported in Python and if it was, the performance and scaling penalties are not ideal.
https://spark.apache.org/docs/2.2.0/mllib-statistics.html
Related Posts:
Stratified sampling in Spark (Mentions sampleByKeyExact which isn't supported in Python)
Investopedia Reference
https://www.investopedia.com/terms/stratified_random_sampling.asp
A creative work-around using additional columns that may work.
pyspark - how to select exact number of records per strata using (df.sampleByKey()) in stratified random sampling

I think there is some confusion here related to standard definitions. Usually when someone says "stratified sampling", they mean that different classes should get different probabilities. In the example posted above
A: 25%, B: 50%, C: 13%, and D: 12%
A standard stratified sample will give fractions that make sure that in expectation, the sample will have the same number of A,B,C,D. For example
fractions = {'A': .2, 'B': .1, 'C': 0.1*50/13, 'D': 0.1*50/12}
should give in expectation 5 elements of each class.
In the example given above where
fractions = {'A': .1, 'B': .1, 'C': 0.1, 'D': 0.1}
The behavior is indeed the same as a simple sample with a proportion of 0.1.
The real question is, what are you aiming for? If you want your sample to have the exact same proportion of classes as the original, then neither sample or sampleByKey will provide that. Looking at the documentation, it seems that indeed sampleByKeyExact will do the trick.
edit detailing the behavior of sample and sampleByKey:
For sample, a map operation basically goes over every element and based on a random variable decides whether to keep the item (and how many copies in case withReplacement == True). This random variable is i.i.d across all elements. In sampleByKey, the random variable is independent but has a different distribution based on the key value, or more accurately based on the corresponding value in the fractions argument. If the values in fractions are identical, this random variable will have the same distribution for all key values - that is why the behavior becomes identical for sample and sampleByKey.

Related

Finding permutations under a set of linear constraints

I have two variables A and B, which both are 1x250 doubles.
Variables A and B are related to one another.
In other words, A(1) and B(1) are a pair (cause and effect).
A = [100, 1000, 254, 21,.....]
B = [-30, -100, -1254, -821,.....]
The problem is to find combinations in A that can meet two conditions below
sums of any combinations in A should be <= 500
At the same time, sum of corresponding value in B should be less than -5000
I tried to use nchoosek, but it really exploded after a few iterations due to the size of my variables.

What you said is pretty much impossible to do in terms of time complexity. If you need to check every combination in set of size of 250, you will need at least 2^250 operations, so that's order of magnitude of 10^75.
Unless you have some data about the values of the set, which lets you skip most of the combinations, or you limit yourself to combinations of 2 or 3, it's impossible to be done in sensible time.

Interpolation on 4D data

I am trying to perform an interpolation/fit (preferably non-linear, but linear should also be fine) on 4D data. My data has a form of:
[a,b,c] = func(input)
obviously, func is unknown and ultimately data looks like (input, a, b, c):
0 -0.1253 0.0341 0.01060
35 -0.0985 0.0176 0.02060
50 -0.0315 -0.0533 0.1118
60 -0.0518 -0.0327 0.03020
80 0.2939 -0.0713 0.05670
100 0.3684 -0.0765 0.06740
I take observations at e.g. input = [0, 35, 50, 60, 80, 100] (0 being min and 100 being max; I take 6 samples in between min and max) and then I get corresponding a, b and c values (I understand that 6 sample points are a bad design of experiment so I will extend it in future).
I am trying to guess the value of a, b and c at say input = 19? Any pointers?
How to estimate goodness of fit in such scenario?

This is not 4D interpolation, this is 3 times 1D interpolation. You just interpolate interp1([0 35],[-0.1253 -0.0985],19) and the same for b and c. (interp1(intput,a,19))
Note that for the most basic 1D interpolation in a mesh grid (not what you have), you need 2 data points in general. For the most basic 2D interpolation, you need 4 data points. For 3D interpolation, 8 minimum, 4D, 16.... (2^d in general).
Also note that 1D interpolation uses 2 "dims". Because you use one to guide the interpolation, the other one is interpolated. General, with [v,a,b,c] data you would use 3D interpolation.
all that said, you do are nto in this case. You have scattered data, not a grid, thus the problem becomes considerably more complicated.
In case you can generate a few more points (not necessarily 16) you can use the function griddatan for interpolating scattered data. Note that you can not just say "give me [a,b,c] for input=19, there could be infinite amount of a,b,cs that have that condition. In any case, you always need to give dim-1 amount of sample points, and get the last one interpolated. Just an advice: this function is computationally and memory-wise very expensive. Do not use for big data points because it will crash your PC.
In the case you want to find a set of parameters that make input=19 then you are getting to more complicated area. You want to minimise a function f(x), where x=[a,b,c] for f(x)=input
In math terms:
argmin_x |f(x)-input|^2= \vec{input}
this is a harder problem and arguably more mathematics than a programming question. Perhaps a ND bspline fitting of your data would be a good f

Arbitrary distribution -> Uniform distribution (Probability Integral Transform?)

I have 500,000 values for a variable derived from financial markets. Specifically, this variable represents distance from the mean (in standard deviations). This variable has a arbitrary distribution. I need a formula that will allow me to select a range around any value of this variable such that an equal (or close to it) amount of data points fall within that range.
This will allow me to then analyze all of the data points within a specific range and to treat them as "similar situations to the input."
From what I understand, this means that I need to convert it from arbitrary distribution to uniform distribution. I have read (but barely understood) that what I am looking for is called "probability integral transform."
Can anyone assist me with some code (Matlab preferred, but it doesn't really matter) to help me accomplish this?

Here's something I put together quickly. It's not polished and not perfect, but it does what you want to do.
clear
randList=[randn(1e4,1);2*randn(1e4,1)+5];
[xCdf,xList]=ksdensity(randList,'npoints',5e3,'function','cdf');
xRange=getInterval(5,xList,xCdf,0.1);
and the function getInterval is
function out=getInterval(yPoint,xList,xCdf,areaFraction)
yCdf=interp1(xList,xCdf,yPoint);
yCdfRange=[-areaFraction/2, areaFraction/2]+yCdf;
out=interp1(xCdf,xList,yCdfRange);
Explanation:
The CDF of the random distribution is shown below by the line in blue. You provide a point (here 5 in the input to getInterval) about which you want a range that gives you 10% of the area (input 0.1 to getInterval). The chosen point is marked by the red cross and the
interval is marked by the lines in green. You can get the corresponding points from the original list that lie within this interval as
newList=randList(randList>=xRange(1) & randList<=xRange(2));
You'll find that on an average, the number of points in this example is ~2000, which is 10% of numel(randList)
numel(newList)
ans =
2045
NOTE:
Please note that this was done quickly and I haven't made any checks to see if the chosen point is outside the range or if yCdfRange falls outside [0 1], in which case interp1 will return a NaN. This is fairly straightforward to implement, and I'll leave that to you.
Also, ksdensity is very CPU intensive. I wouldn't recommend increasing npoints to more than 1e4. I assume you're only working with a fixed list (i.e., you have a list of 5e5 points that you've obtained somehow and now you're just running tests/analyzing it). In that case, you can run ksdensity once and save the result.

I do not speak Matlab, but you need to find quantiles in your data. This is Mathematica code which would do this:
In[88]:= data = RandomVariate[SkewNormalDistribution[0, 1, 2], 10^4];
Compute quantile points:
In[91]:= q10 = Quantile[data, Range[0, 10]/10];
Now form pairs of consecutive quantiles:
In[92]:= intervals = Partition[q10, 2, 1];
In[93]:= intervals
Out[93]= {{-1.397, -0.136989}, {-0.136989, 0.123689}, {0.123689,
0.312232}, {0.312232, 0.478551}, {0.478551, 0.652482}, {0.652482,
0.829642}, {0.829642, 1.02801}, {1.02801, 1.27609}, {1.27609,
1.6237}, {1.6237, 4.04219}}
Verify that the splitting points separate data nearly evenly:
In[94]:= Table[Count[data, x_ /; i[[1]] <= x < i[[2]]], {i, intervals}]
Out[94]= {999, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000}

How to compare different distribution means with reference truth value in Matlab?

I have production (q) values from 4 different methods stored in the 4 matrices. Each of the 4 matrices contains q values from a different method as:
Matrix_1 = 1 row x 20 column
Matrix_2 = 100 rows x 20 columns
Matrix_3 = 100 rows x 20 columns
Matrix_4 = 100 rows x 20 columns
The number of columns indicate the number of years. 1 row would contain the production values corresponding to the 20 years. Other 99 rows for matrix 2, 3 and 4 are just the different realizations (or simulation runs). So basically the other 99 rows for matrix 2,3 and 4 are repeat cases (but not with exact values because of random numbers).
Consider Matrix_1 as the reference truth (or base case ). Now I want to compare the other 3 matrices with Matrix_1 to see which one among those three matrices (each with 100 repeats) compares best, or closely imitates, with Matrix_1.
How can this be done in Matlab?
I know, manually, that we use confidence interval (CI) by plotting the mean of Matrix_1, and drawing each distribution of mean of Matrix_2, mean of Matrix_3 and mean of Matrix_4. The largest CI among matrix 2, 3 and 4 which contains the reference truth (or mean of Matrix_1) will be the answer.
mean of Matrix_1 = (1 row x 1 column)
mean of Matrix_2 = (100 rows x 1 column)
mean of Matrix_3 = (100 rows x 1 column)
mean of Matrix_4 = (100 rows x 1 column)
I hope the question is clear and relevant to SO. Otherwise please feel free to edit/suggest anything in question. Thanks!
EDIT: My three methods I talked about are a1, a2 and a3 respectively. Here's my result:
ci_a1 =
1.0e+008 *
4.084733001497999
4.097677503988565
ci_a2 =
1.0e+008 *
5.424396063219890
5.586301025525149
ci_a3 =
1.0e+008 *
2.429145282593182
2.838897116739112
p_a1 =
8.094614835195452e-130
p_a2 =
2.824626709966993e-072
p_a3 =
3.054667629953656e-012
h_a1 = 1; h_a2 = 1; h_a3 = 1
None of my CI, from the three methods, includes the mean ( = 3.454992884900722e+008) inside it. So do we still consider p-value to choose the best result?

If I understand correctly the calculation in MATLAB is pretty strait-forward.
Steps 1-2 (mean calculation):
k1_mean = mean(k1);
k2_mean = mean(k2);
k3_mean = mean(k3);
k4_mean = mean(k4);
Step 3, use HIST to plot distribution histograms:
hist([k2_mean; k3_mean; k4_mean]')
Step 4. You can do t-test comparing your vectors 2, 3 and 4 against normal distribution with mean k1_mean and unknown variance. See TTEST for details.
[h,p,ci] = ttest(k2_mean,k1_mean);

EDIT : I misinterpreted your question. See the answer of Yuk and following comments. My answer is what you need if you want to compare distributions of two vectors instead of a vector against a single value. Apparently, the latter is the case here.
Regarding your t-tests, you should keep in mind that they test against a "true" mean. Given the number of values for each matrix and the confidence intervals it's not too difficult to guess the standard deviation on your results. This is a measure of the "spread" of your results. Now the error on your mean is calculated as the standard deviation of your results divided by the number of observations. And the confidence interval is calculated by multiplying that standard error with appx. 2.
This confidence interval contains the true mean in 95% of the cases. So if the true mean is exactly at the border of that interval, the p-value is 0.05 the further away the mean, the lower the p-value. This can be interpreted as the chance that the values you have in matrix 2, 3 or 4 come from a population with a mean as in matrix 1. If you see your p-values, these chances can be said to be non-existent.
So you see that when the number of values get high, the confidence interval becomes smaller and the t-test becomes very sensitive. What this tells you, is nothing more that the three matrices differ significantly from the mean. If you have to choose one, I'd take a look at the distributions anyway. Otherwise the one with the closest mean seems a good guess. If you want to get deeper into this, you could also ask on stats.stackexchange.com
Your question and your method aren't really clear :
Is the distribution equal in all columns? This is important, as two distributions can have the same mean, but differ significantly :
is there a reason why you don't use the Central Limit Theorem? This seems to me like a very complex way of obtaining a result that can easily be found using the fact that the distribution of a mean approaches a normal distribution where sd(mean) = sd(observations)/number of observations. Saves you quite some work -if the distributions are alike! -
Now if the question is really the comparison of distributions, you should consider looking at a qqplot for a general idea, and at a 2-sample kolmogorov-smirnov test for formal testing. But please read in on this test, as you have to understand what it does in order to interprete the results correctly.
On a sidenote : if you do this test on multiple cases, make sure you understand the problem of multiple comparisons and use the appropriate correction, eg. Bonferroni or Dunn-Sidak.

Is my subset sum algorithm of polynomial time?

I came up with a new algorithm to solve the subset sum problem, and I think it's in polynomial time. Tell me I'm either wrong or a total genius.
Quick starter facts:
Subset sum problem is an NP-complete problem. Solving it in polynomial time means that P = NP. The number of subsets in a set of length N, is 2^N.
On the more useful hand, the number of unique subsets in the length N is at maximum the sum of the whole set minus the smallest element, or, the range of sums that any subset can possibly produce is between the sum of all the negative elements and the sum of all the positive elements, since no sum can possibly be bigger or smaller than all the positive or negative sums, which grow at a linear rate when we add extra elements.
What this means is that as N increases, the number of duplicate subsets increases exponentially, and the number of unique, useful subsets increases only linearly. If an algorithm could be devised that could remove the duplicate subsets at the earliest possible opportunity, we would run in polynomial time. A quick example is easily taken from binary. From only the numbers that are powers of two, we can create unique subsets for any integral value. As such, any subset involving any other number (if we had all powers of two) is a duplicate and a waste. By not computing them and their derivatives, we can save virtually all the running time of the algorithm, since the numbers which are powers of two are logarithmically occurring compared to any integer.
As such, I propose a simple algorithm that will remove these duplicates and save having to compute them and all their derivatives.
To begin with, we'll sort the set which is only O(N log N), and split it into two halves, positive and negative. The procedure for the negative numbers is identical, so I'll only outline the positive numbers (the set now means just the positive half, just for clarification).
Imagine an array indexed by sum, which has entries for all the possible result sums of the positive side (which is only linear, remember). When we add an entry, the value is the entries in the subset. So like, array[3] = { 1, 2 }.
In general, we now move to enumerate all subsets in the set. We do this by starting with the subsets of one length, then adding to them. When we have all the unique subsets, they form an array, and we simply iterate them in the fashion used in Horowitz/Sahni.
Now we start with the "first generation" values. That is, if there were no duplicate numbers in the original data set, there are guaranteed to be no duplicates in these values. That is, all single-value subsets, and all length of the set minus one length subsets. These can easily be generated by summing the set, and subtracting each element in turn. In addition the set itself is a valid first generation sum and subset, as well as each individual element of the subset.
Now we do the second generation values. We loop through each value in the array and for each unique subset, if it doesn't have it, we add it and compute the new unique subset. If we have a duplicate, fun occurs. We add it to a collision list. When we come to add new subsets, we check if they're on the collision list. We key by the less desirable (normally larger, but can be arbitrary) equal subset. When we come to add to subsets, if we would generate a collision, we simply do nothing. When we come to add the more desirable subset, it misses the check and adds, generating the common subset. Then we just repeat for the other generations.
By removing duplicate subsets in this manner, we don't have to keep combining the duplicates with the rest of the set, nor keep checking them for collisions, nor sum them. Most importantly, by not creating new subsets that are non-unique, we're not generating new subsets from them, which can, I believe, turn the algorithm from NP to P, since the growth of subsets is no longer exponential- we discard the vast majority of them before they can "reproduce" in the next generation and create more subsets by being combined with the other non-duplicate subsets.
I don't think I've explained this too well. I have pictures... they're in my head. The important thing is that by discarding duplicate subsets, you could remove virtually all of the complexity.
For example, imagine (because I'm doing this example by hand) a simple dataset that goes -7 to 7 (not zero) for which we aim at zero. Sort and split, so we're left with (1, 2, 3, 4, 5, 6, 7). The sum is 28. But 2^7 is 128. So 128 subsets fit in the range 1 .. 28, meaning that we know in advance that 100 sets are duplicates. If we had 8, then we'd only have 36 slots, but now 256 subsets. So you can easily see that the number of dupes would now be 220, greater than double what it was before.
In this case, the first generation values are 1, 2, 3, 4, 5, 6, 7, 28, 27, 26, 25, 24, 23, 22, 21, and we map them to their constituent components, so
1 = { 1 }
2 = { 2 }
...
28 = { 1, 2, 3, 4, 5, 6, 7 }
27 = { 2, 3, 4, 5, 6, 7 }
26 = { 1, 3, 4, 5, 6, 7 }
...
21 = { 1, 2, 3, 4, 5, 6 }
Now to generate the new subsets, we take each subset in turn and add it to each other subset, unless they have a mutual subsubset, e.g. 28 and 27 have a hueg mutual subsubset. So when we take 1 and we add it to 2, we get 3 = { 1, 2 } but owait! It's already in the array. What this means is that we now don't add 1 to any subset that already has 2 in it, and vice versa, because that's a duplicate on 3's subsets.
Now we have
1 = { 1 }
2 = { 2 }
// Didn't add 1 to 2 to get 3 because that's a dupe
3 = { 3 } // Add 1 to 3, amagad, get a duplicate. Repeat the process.
4 = { 4 } // And again.
...
8 = { 1, 7 }
21? Already has 1 in.
...
27? We already have 28
Now we add 2 to all.
1? Existing duplicate
3? Get a new duplicate
...
9 = { 2, 7 }
10 = { 1, 2, 7 }
21? Already has 2 in
...
26? Already have 28
27? Got 2 in already.
3?
1? Existing dupe
2? Existing dupe
4? New duplicate
5? New duplicate
6? New duplicate
7? New duplicate
11 = { 1, 3, 7 }
12 = { 2, 3, 7 }
13 = { 1, 2, 3, 7 }
As you can see, even though I am still adding new subsets each time, the quantity is only going up linearly.
4?
1? Existing dupe
2? Existing dupe
3? Existing dupe
5? New duplicate
6? New duplicate
7? New duplicate
8? New duplicate
9? New duplicate
14 = {1, 2, 4, 7}
15 = {1, 3, 4, 7}
16 = {2, 3, 4, 7}
17 = {1, 2, 3, 4, 7}
5?
1,2,3,4 existing duplicate
6,7,8,9,10,11,12 new duplicate
18 = {1, 2, 3, 5, 7}
19 = {1, 2, 4, 5, 7}
20 = {1, 3, 4, 5, 7}
21 = new duplicate
Now we have every value in the range, so we stop, and add to our list 1-28. Repeat for negative numbers, iterate through lists.
Edit:
This algorithm is totally wrong in any case. Subsets which have duplicate sums are not duplicates for the purposes of which subsets can be spawned from them, because they are arrived at differently- i.e., they cannot be folded.

This does not prove P = NP.
You have failed to consider the possibility where the positive numbers are: 1, 2, 4, 8, 16, etc... and so there will be no duplicates when you sum subsets, so it will run in O(2^N) time in this case.
You can treat this as a special case but still the algorithm is still not polynomial for other similar cases. This assumption that you made is where you go away from the NP-complete version of subset sum to solving only easy (polynomial time) problems:
[assume the sum of the positive numbers grows] at a linear rate when we add extra elements.
Here you are effectively assuming that P (i.e. number of bits required to state the problem) is smaller than N. Quote from Wikipedia:
Thus, the problem is most difficult if N and P are of the same order.
If you assume that N and P are of the same order then you can't assume that the sum grows linearly indefinitely as you add more elements. As you add more elements to your set those elements also need to get larger to ensure that problem remains hard to solve.
If P (the number of place values) is a small fixed number, then there are dynamic programming algorithms that can solve it exactly.
You have rediscovered one of these algorithms. It's a nice piece of work but it isn't something new and it doesn't prove P = NP. Sorry!

Dead MG,
It has been almost half a year since you posted but I will answer anyway.
Mark Byers wrote most of what should be written.
The algorithm is known.
Such algorithms are known as generating functions algorithms or simply as dynamic programming algorithms.
Your algorihtm has very important feature, the so called pseudopolynomial complexity.
Traditional complexity is a function of the size of the problem. In terms of traditional complexity your algorithm has O(2^n) pessimistic complexity (that is for the numbers 1,2, 4,... as was mentioned earlier )
The complexity of your algorithm algorithm can be alternatively expressed as the function of the size of the problem and the size of some numbers in the problem. For your algorithm it would be something like O(nw) where w is the number of distinct sums.
This is psuedopolynomial complexity. It is a VERY important feature. Such algorithms can solve lots of real-world problem instances, despite problem complexity class.
Horowitz and Sahni algorithm has pessimistic complexity O(2^N/2). This is not two times better than your algorithm but lot's more - 2^N/2 times better than your algorithm. What Greg probably meant was that Horowitz and Sahni algorithm can solve twice as big instances of the problem (having twice as many numbers in the subset sum)
That's true in theory but in practice Horowitz and Sahni can solve (on home computers) instances with about 60 numbers, while the algorithm similiar to yours can handle even instances with 1000 numbers (provided that the numbers aren't too big themselves)
In fact the two algorithms can even be mixed, that is of your kind and of Horowitz and Sahni algorithm. Such solution has both pseudopolynomial complexity and pessimistic complexity of O(2^n/2).
A trained computer sciencist can construct such algorithm as yours by means of generating functions theory.
Both trained and untrained can think it up the way you did.
Do not necessarily think in terms "is it known?". It should be important to you that you can invent such algorithm on your own. It means that you probably can invent other important algorithms on your own and someday one that isn't known maybe. Knowing current progress in the field and what's in the literature helps. Otherwise you will keep on reinventing the wheel.
When I was way back in high school I reinvented Dijkstra algorithm. My version had terrible complexity because I didn't know anything about data structures. Anyway, I am still proud of myself.
If you are still studying pay attention to generating functions theory.
You may also want to check out on wiki:
psuedopolynomial time
weakly NP-complete
strongly NP-complete
generating functions
Megli

What this means is that as N increases, the number of duplicate subsets increases exponentially, and the number of unique, useful subsets increases only linearly.
Not necessarily - the number of duplicate subset sums is also determined by the value of the number closest to zero in the set (that the greater the minimum distance to zero - the fewer the duplicate subset sums for the set).
In general, we now move to enumerate all subsets in the set.
Unfortunately, enumerating all the sums of the subsets of the set requires performing an exponential number of addition operations (2^7 or 128 in your example). Otherwise, how would the algorithm determine what the unique sums happen to be? So, although the steps that follow the first step could very well have a polynomial running time, the algorithm as a whole has exponential complexity (because an algorithm is only as fast as its slowest part).
Incidentally, best known algorithm for solving the subset sum problem (Horowitz and Sahni, 1974) has O(2^N/2) complexity - which makes it about twice as fast as this algorithm.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse