even distribution of target values with pyspark sampling - pyspark

want to split the data into train and test with Pyspark. My target column is called "ActionName". The issue is that I have 78% of the data assigned to 1 and only 22% to 0. I want to create a sample dataset where ones and zeroes would be evenly distributed. I have tried the following:
df.groupBy("ActionName").count().show()
+----------+------+
|ActionName| count|
+----------+------+
| 1|566435|
| 0|175905|
+----------+------+
train = df.sampleBy("ActionName", fractions={0: 0.5, 1: 0.5}, seed=700000)
train.groupBy("ActionName").count().show()
+----------+------+
|ActionName| count|
+----------+------+
| 1|283282|
| 0| 88264|
+----------+------+
It has the exact same distribution of 78% and 22%.

You seem to think that the fractions argument controls the proportions of the target dataframe, but this is not the case; it actually controls how many elements of the source dataframe we should sample. Given that you use fractions={0: 0.5, 1: 0.5}, it is no surprise that you end up with the exact same proportions - what you have actually asked is to keep half the samples of each class.
Assuming that you want to keep all your minority class (0) samples and only downsample the majority class so that you end up with a balanced dataset, you need:
train = df.sampleBy("ActionName", fractions={0: 1.0, 1: 0.31}, seed=700000)
where 0.31 = 175905/566435.
Similarly, if you want to get a balanced dataset with half the samples of the minority class, you should use
train = df.sampleBy("ActionName", fractions={0: 0.5, 1: 0.155}, seed=700000)
where 0.155 = 0.31/2.
You get the idea...

Related

Problem understanding Loss function behavior using Flux.jl. in Julia

So. First of all, I am new to Neural Network (NN).
As part of my PhD, I am trying to solve some problem through NN.
For this, I have created a program that creates some data set made of
a collection of input vectors (each with 63 elements) and its corresponding
output vectors (each with 6 elements).
So, my program looks like this:
Nₜᵣ = 25; # number of inputs in the data set
xtrain, ytrain = dataset_generator(Nₜᵣ); # generates In/Out vectors: xtrain/ytrain
datatrain = zip(xtrain,ytrain); # ensamble my data
Now, both xtrain and ytrain are of type Array{Array{Float64,1},1}, meaning that
if (say)Nₜᵣ = 2, they look like:
julia> xtrain #same for ytrain
2-element Array{Array{Float64,1},1}:
[1.0, -0.062, -0.015, -1.0, 0.076, 0.19, -0.74, 0.057, 0.275, ....]
[0.39, -1.0, 0.12, -0.048, 0.476, 0.05, -0.086, 0.85, 0.292, ....]
The first 3 elements of each vector is normalized to unity (represents x,y,z coordinates), and the following 60 numbers are also normalized to unity and corresponds to some measurable attributes.
The program continues like:
layer1 = Dense(length(xtrain[1]),46,tanh); # setting 6 layers
layer2 = Dense(46,36,tanh) ;
layer3 = Dense(36,26,tanh) ;
layer4 = Dense(26,16,tanh) ;
layer5 = Dense(16,6,tanh) ;
layer6 = Dense(6,length(ytrain[1])) ;
m = Chain(layer1,layer2,layer3,layer4,layer5,layer6); # composing the layers
squaredCost(ym,y) = (1/2)*norm(y - ym).^2;
loss(x,y) = squaredCost(m(x),y); # define loss function
ps = Flux.params(m); # initializing mod.param.
opt = ADAM(0.01, (0.9, 0.8)); #
and finally:
trainmode!(m,true)
itermax = 700; # set max number of iterations
losses = [];
for iter in 1:itermax
Flux.train!(loss,ps,datatrain,opt);
push!(losses, sum(loss.(xtrain,ytrain)));
end
It runs perfectly, however, it comes to my attention that as I train my model with an increasing data set(Nₜᵣ = 10,15,25, etc...), the loss function seams to increase. See the image below:
Where: y1: Nₜᵣ=10, y2: Nₜᵣ=15, y3: Nₜᵣ=25.
So, my main question:
Why is this happening?. I can not see an explanation for this behavior. Is this somehow expected?
Remarks: Note that
All elements from the training data set (input and output) are normalized to [-1,1].
I have not tryed changing the activ. functions
I have not tryed changing the optimization method
Considerations: I need a training data set of near 10000 input vectors, and so I am expecting an even worse scenario...
Some personal thoughts:
Am I arranging my training dataset correctly?. Say, If every single data vector is made of 63 numbers, is it correctly to group them in an array? and then pile them into an ´´´Array{Array{Float64,1},1}´´´?. I have no experience using NN and flux. How can I made a data set of 10000 I/O vectors differently? Can this be the issue?. (I am very inclined to this)
Can this behavior be related to the chosen act. functions? (I am not inclined to this)
Can this behavior be related to the opt. algorithm? (I am not inclined to this)
Am I training my model wrong?. Is the iteration loop really iterations or are they epochs. I am struggling to put(differentiate) this concept of "epochs" and "iterations" into practice.
loss(x,y) = squaredCost(m(x),y); # define loss function
Your losses aren't normalized, so adding more data can only increase this cost function. However, the cost per data doesn't seem to be increasing. To get rid of this effect, you might want to use a normalized cost function by doing something like using the mean squared cost.

How can I use skewnorm to produce a distribution with the specified skew?

I am trying to produce a random distribution where I control the mean, SD, skewness and kurtosis.
I can solve the mean and SD with some simple maths after the distribution is produced.
Kurtosis I am leaving on the shelf for the moment because it just seems too hard.
Skewness is today's problem.
import scipy.stats
def convert_to_alpha(s):
d=(np.pi/2*((abs(s)**(2/3))/(abs(s)**(2/3)+((4-np.pi)/2)**(2/3))))**0.5
a=((d)/((1-d**2)**.5))
return(a)
for skewness_expected in (.5, .9, 1.3):
alpha = convert_to_alpha(skewness_expected)
r = stats.skewnorm.rvs(alpha,size=10000)
print('Skewness expected:',skewness_expected)
print('Skewness obtained:',stats.skew(r))
print()
Skewness expected: 0.5
Skewness obtained: 0.47851348006629035
Skewness expected: 0.9
Skewness obtained: 0.8917020428586827
Skewness expected: 1.3
Skewness obtained: (1.2794406116842627+0.01780402125888404j)
I understand that the calculated skewness will generally not match the desired skewness - this is a random distribution, after all. But I am confused as to how I can get a distribution with a skewness > 1 without falling into complex number territory. The rvs method appears incapable of handling it, since the parameter alpha is an imaginary number whenever skewness > 1.
How can I fix it so that I can generate distributions with skewness > 1, but not have complex numbers creeping in?
[With credit to Warren Weckesser for pointing me at Wikipedia in order to write the convert_to_alpha function.]
Understand this thread is a year and a half old now, but I've run into this problem recently as well and it never seemed to get answered here. The further problem with converting between alpha from stats.skewnorm and the skewness statistic (excellent function to do that by the way) is that doing so will also alter the measures of central tendency for the distribution, which was problematic for my needs.
I've developed this based on the F-distribution (https://en.wikipedia.org/wiki/F-distribution). The end result of a lot of work is this function for which you specify the mean, SD and skewness required, and desired sample size. I can share the work behind it if anyone wishes. The output SD and skew become a little rough at extreme settings. Presumably because the F-distribution naturally sits around 1. It is also very problematic for skew values close to zero, in which case there would be no need for this function anyway.
from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
def createSkewDist(mean, sd, skew, size):
# calculate the degrees of freedom 1 required to obtain the specific skewness statistic, derived from simulations
loglog_slope=-2.211897875506251
loglog_intercept=1.002555437670879
df2=500
df1 = 10**(loglog_slope*np.log10(abs(skew)) + loglog_intercept)
# sample from F distribution
fsample = np.sort(stats.f(df1, df2).rvs(size=size))
# adjust the variance by scaling the distance from each point to the distribution mean by a constant, derived from simulations
k1_slope = 0.5670830069364579
k1_intercept = -0.09239985798819927
k2_slope = 0.5823114978219056
k2_intercept = -0.11748300123471256
scaling_slope = abs(skew)*k1_slope + k1_intercept
scaling_intercept = abs(skew)*k2_slope + k2_intercept
scale_factor = (sd - scaling_intercept)/scaling_slope
new_dist = (fsample - np.mean(fsample))*scale_factor + fsample
# flip the distribution if specified skew is negative
if skew < 0:
new_dist = np.mean(new_dist) - new_dist
# adjust the distribution mean to the specified value
final_dist = new_dist + (mean - np.mean(new_dist))
return final_dist
'''EXAMPLE'''
desired_mean = 497.68
desired_skew = -1.75
desired_sd = 77.24
final_dist = createSkewDist(mean=desired_mean, sd=desired_sd, skew=desired_skew, size=1000000)
# inspect the plots & moments, try random sample
fig, ax = plt.subplots(figsize=(12,7))
sns.distplot(final_dist, hist=True, ax=ax, color='green', label='generated distribution')
sns.distplot(np.random.choice(final_dist, size=100), hist=True, ax=ax, color='red', hist_kws={'alpha':.2}, label='sample n=100')
ax.legend()
print('Input mean: ', desired_mean)
print('Result mean: ', np.mean(final_dist),'\n')
print('Input SD: ', desired_sd)
print('Result SD: ', np.std(final_dist),'\n')
print('Input skew: ', desired_skew)
print('Result skew: ', stats.skew(final_dist))
Input mean: 497.68
Result mean: 497.6799999999999
Input SD: 77.24
Result SD: 71.69030764848961
Input skew: -1.75
Result skew: -1.6724486459469905
The shape parameter of the skew-normal distribution is not the skewness of the distribution. Check out the wikipedia page for the skew normal distribution. The formulas in the table on the right give the expressions for the mean, variance, skewness, etc., in terms of the parameters. You can get these values from the skewnorm object with the stats() method.
For example, here's the skewness of the distribution with shape parameter 2:
In [46]: from scipy.stats import skewnorm, skew
In [47]: skewnorm.stats(2, moments='s')
Out[47]: array(0.45382556395938217)
Generate a couple samples and find the sample skewness:
In [48]: r = skewnorm.rvs(2, size=10000000)
In [49]: skew(r)
Out[49]: 0.4533209955299838
In [50]: r = skewnorm.rvs(2, size=10000000)
In [51]: skew(r)
Out[51]: 0.4536583726840712

How to setup fitensemble for binary imbalanced datasets?

I've been trying to test matlab's ensemble methods with randomly generated imbalance dataset and no matter what I set the prior/cost/weight parameters the method never predicts close to the label ratio.
Below is an example of the tests I did.
prob = 0.9; %set label ratio to 90% 1 and 10% 0
y = (rand(100,1) < prob);
X = rand(100,3); %generate random training data with three features
X_test = rand(100,3); %generate random test data
%A few parameter sets I've tested
B = TreeBagger(100,X,y);
B2 = TreeBagger(100,X,y,'Prior','Empirical');
B3 = TreeBagger(100,X,y,'Cost',[0,9;1,0]);
B4 = TreeBagger(100,X,y,'Cost',[0,1;9,0]);
B5 = fitensemble(X,y,'RUSBoost', 20, 'Tree', 'Prior', 'Empirical');
Here I tried to predict the trained classifiers on random test data. My assumption is that since the classifier is trained on random data, it should on average predict close to the dataset ratio (1/9) if it takes the prior into account. But each of the classifiers predicted 98-100% in favor of '1' instead of ~90% that I am looking for.
l1 = predict(B,X_test);
l2 = predict(B2,X_test);
l3 = predict(B3,X_test);
l4 = predict(B4,X_test);
l5 = predict(B5,X_test);
How do I get the ensemble method to take the prior into account? Or is there a fundamental misunderstanding on my part?
I don't think it can work like you think.
Thats because as i understood your training and test data is random. So how should your classifier find any relation between features and your label?
lets take the accuracy as a mesurement and make an example.
class A: 900 datarows.
class B: 100 datarows.
Classify 100% as A:
0.9*/(0.1+0.9) = 0.9
gets you 90% Accuracy.
if your classifier does something different, means trying to classify some datarows to B he will by chance get 9 times more wrongly classified A datarows
Lets say 20 B datarows are correctly classified you will get around 180 wrong a classified A datarows
B: 20 correct, 80 incorrect
A: 720 correct, 180 wrong
740/(740+260) = 0.74
Accuracy goes down to 74 %. And thats not something your classifying algorithms want.
Long story short: Your classifier will allways tend to classify allmost 100% class A if you dont get any information into your Data

Spark: How to run logistic regression using only some features from LabeledPoint?

I have a LabeledPoint on witch I want to run logistic regression:
Data: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint] =
MapPartitionsRDD[3335] at map at <console>:44
using code:
val splits = Data.randomSplit(Array(0.75, 0.25), seed = 2L)
val training = splits(0).cache()
val test = splits(1)
val numIterations = 100
val model = LogisticRegressionWithSGD.train(training, numIterations)
My problem is that I don't want to use all of the features from LabeledPoint, but only some of them. I've got a list o features that I wan't to use, for example:
LoF=List(223244,334453...
How can I get only the features that I want to use from LabeledPoint o select them in logistic regression?
Feature selection allows selecting the most relevant features for use in model construction. Feature selection reduces the size of the vector space and, in turn, the complexity of any subsequent operation with vectors. The number of features to select can be tuned using a held-out validation set.
One way to do what you are seeking is using the ElementwiseProduct.
ElementwiseProduct multiplies each input vector by a provided “weight” vector, using element-wise multiplication. In other words, it scales each column of the dataset by a scalar multiplier. This represents the Hadamard product between the input vector, v and transforming vector, w, to yield a result vector.
So if we set the weight of the features we want to keep to 1.0 and the others to 0.0, we can say that the remaining resulting features computed by the ElementwiseProduct of the original vector and the 0-1 weight vectors will select the features we need :
import org.apache.spark.mllib.feature.ElementwiseProduct
import org.apache.spark.mllib.linalg.Vectors
// Creating dummy LabeledPoint RDD
val data = sc.parallelize(Array(LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0,5.0,1.0)), LabeledPoint(1.0,Vectors.dense(4.0, 5.0, 6.0,1.0,2.0)),LabeledPoint(0.0,Vectors.dense(4.0, 2.0, 3.0,0.0,2.0))))
data.toDF.show
// +-----+--------------------+
// |label| features|
// +-----+--------------------+
// | 1.0|[1.0,0.0,3.0,5.0,...|
// | 1.0|[4.0,5.0,6.0,1.0,...|
// | 0.0|[4.0,2.0,3.0,0.0,...|
// +-----+--------------------+
// You'll need to know how many features you have, I have used 5 for the example
val numFeatures = 5
// The indices represent the features we want to keep
// Note : indices start with 0 so actually here you are keeping features 4 and 5.
val indices = List(3, 4).toArray
// Now we can create our weights vectors
val weights = Array.fill[Double](indices.size)(1)
// Create the sparse vector of the features we need to keep.
val transformingVector = Vectors.sparse(numFeatures, indices, weights)
// Init our vector transformer
val transformer = new ElementwiseProduct(transformingVector)
// Apply it to the data.
val transformedData = data.map(x => LabeledPoint(x.label,transformer.transform(x.features).toSparse))
transformedData.toDF.show
// +-----+-------------------+
// |label| features|
// +-----+-------------------+
// | 1.0|(5,[3,4],[5.0,1.0])|
// | 1.0|(5,[3,4],[1.0,2.0])|
// | 0.0| (5,[4],[2.0])|
// +-----+-------------------+
Note:
You noticed that I used the sparse vector representation for space optimization.
features are sparse vectors.

Multi-class regression in nolearn?

I'm trying to build a Neural Network using nolearn that can do regression on multiple classes.
For example:
net = NeuralNet(layers=layers_s,
input_shape=(None, 2048),
l1_num_units=8000,
l2_num_units=4000,
l3_num_units=2000,
l4_num_units=1000,
d1_p = 0.25,
d2_p = 0.25,
d3_p = 0.25,
d4_p = 0.1,
output_num_units=noutput,
output_nonlinearity=None,
regression=True,
objective_loss_function=lasagne.objectives.squared_error,
update_learning_rate=theano.shared(float32(0.1)),
update_momentum=theano.shared(float32(0.8)),
on_epoch_finished=[
AdjustVariable('update_learning_rate', start=0.1, stop=0.001),
AdjustVariable('update_momentum', start=0.8, stop=0.999),
EarlyStopping(patience=200),
],
verbose=1,
max_epochs=1000)
noutput is the number of classes for which I want to do regression, if I set this to 1 everything works. When I use 26 (the number of classes here) as output_num_unit I get a Theano dimension error. (dimension mismatch in args to gemm (128,1000)x(1000,26)->(128,1))
The Y labels are continues variables, corresponding to a class. I tried to reshape the Y labels to (rows,classes) but this means I have to give a lot of the Y labels a value of 0 (because the value for that class is unknown). Is there any way to do this without setting some y_labels to 0?
If you want to do multiclass (or multilabel) regression with 26 classes, your output must not have shape (1082,), but (1082, 26). In order to preprocess your output, you can use sklearn.preprocessing.label_binarize
which will transform your 1D output to 2D output.
Also, your output non linearity should be a softmax function, so that the rows of your output sum to 1.