I'm trying to use k-fold cross-validation to tune a regressive tree generated in pyspark. However, from what I've seen so far, it is not possible to combine pyspark's CrossValidator with pyspark's DecisionTree.trainRegressor. Here is the relevant code.
(trainingData, testData) = data.randomSplit([0.7, 0.3])
model = DecisionTree.trainRegressor(trainingData, categoricalFeaturesInfo={}, impurity='variance', maxDepth=5, maxBins=32)
How do I then apply the k-fold cross-validation to the regressor?
You can try this:
(trainingData, testData) = data.randomSplit([0.7, 0.3])
model = DecisionTree.trainRegressor(trainingData, categoricalFeaturesInfo={}, impurity='variance', numClasses=2)
paramGrid = ParamGridBuilder() \
.addGrid(model.maxDepth, [4, 5, 6, 7]) \
.addGrid(model.maxBins, [24, 28, 32, 36]) \
.build()
crossval = CrossValidator(estimator=model,
estimatorParamMaps=paramGrid,
evaluator=BinaryClassificationEvaluator(),
numFolds=3)
# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(training)
Related
This is a regression problem.
The shape of my training is: (417, 5) and the test data shape is: (105, 5). I do scaling for both using the following code:
from sklearn import preprocessing
import sklearn
from sklearn.preprocessing import MinMaxScaler
#Scale train
scaler = preprocessing.MinMaxScaler()
train_df = scaler.fit_transform(train_df)
train_df = pd.DataFrame(train_df)
#Scale test
test_df = scaler.fit_transform(test_df)
test_df = pd.DataFrame(test_df)
First four rows of training data after scaling look like below:
while '4' is the dependent variable and the rest are independent variables.
After training using deep neural network, I get predictions in scaled form. I try to unscale predictions using the following code:
scaler.inverse_transform(y_pred_dnn)
while predictions are stored in y_pred_dnn
But I get the following error:
ValueError: non-broadcastable output operand with shape (105,1) doesn't match the broadcast shape (105,5)
How do I debug the problem?
Thanks
you can solve this by separating out y before scaling. You dont need to scale y for prediction. So try:
y_train, y_test = train_df.iloc[:, 4], test_df.iloc[:, 4]
X_train, X_test = train_df.iloc[:, 1:4], test_df.iloc[:, 1:4]
After this you do te scaling on X part only and you wont need any inverse scaling
I'm trying to build a simple linear regression model in spark using scala. To test the method I'm trying to perform a single valriable regression using a test data set.
my data set is as follows.
x - integers from 1 to 100
y - random values generated from excel using the formula =RANDBETWEEN(-10,10)*RAND() + x_i
I've run a regression for this data set using python sklearn library and it gives me the best fit line (with r2 = 0.98) for the data as expected.
However, if I run a regression using spark my prediction has a constant value for all the x values in the dataset with an r2 value of 2e-16.
Why doesn't this code give me the best fit line as the prediction? What am I missing?
Here's the code I'm using
Python Code that works
x = np.array(df['x'])
y = np.array(df['x'])
x = x.reshape(-1,1)
y = y.reshape(-1,1)
clf = LinearRegression(normilize=True)
clf.fit(x,y)
y_predictions = clf.predict(x)
print(r2_score(y, y_predictions))
Here's a plot from the python regression.
Scala code that gives a constant prediction
val labelCol = "y"
val assembler = new VectorAssembler()
.setInputCols(Array("x"))
.setOutputCol("features")
val df2 = assembler.transform(df)
val labelIndexer = new StringIndexer().setInputCol(labelCol).setOutputCol("label")
val df3 = labelIndexer.fit(df2).transform(df2)
val regressor = new LinearRegression()
.setMaxIter(10)
.setRegParam(1.0)
.setElasticNetParam(1.0)
val model = regressor.fit(df3)
val predictions = model.transform(df3)
val modelSummary = model.summary
println(s"r2 = ${modelSummary.r2}")
The issue was using the stringIndexer which should not be used on numeric columns. In my case, instead of using the stringIndxer, I should've just renamed the y column to label. This fixes the problem.
I am a little new to neural networks and keras. I have some images with size 6*7 and the size of the filter is 15. I want to have several filters and train a convolutional layer separately on each and then combine them. I have looked at one example here:
model = Sequential()
model.add(Convolution2D(nb_filters, kernel_size[0], kernel_size[1],
border_mode='valid',
input_shape=input_shape))
model.add(Activation('relu'))
model.add(Convolution2D(nb_filters, kernel_size[0], kernel_size[1]))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=pool_size))
model.add(Dropout(0.25))
model.add(Flatten(input_shape=input_shape))
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(nb_classes))
model.add(Activation('tanh'))
This model works with one filter. Can anybody give me some hints on how to modify the model to work with parallel convolutional layers.
Thanks
Here is an example of designing a network of parallel convolution and sub sampling layers in keras version 2. I hope this resolves your problem.
rows, cols = 100, 15
def create_convnet(img_path='network_image.png'):
input_shape = Input(shape=(rows, cols, 1))
tower_1 = Conv2D(20, (100, 5), padding='same', activation='relu')(input_shape)
tower_1 = MaxPooling2D((1, 11), strides=(1, 1), padding='same')(tower_1)
tower_2 = Conv2D(20, (100, 7), padding='same', activation='relu')(input_shape)
tower_2 = MaxPooling2D((1, 9), strides=(1, 1), padding='same')(tower_2)
tower_3 = Conv2D(20, (100, 10), padding='same', activation='relu')(input_shape)
tower_3 = MaxPooling2D((1, 6), strides=(1, 1), padding='same')(tower_3)
merged = keras.layers.concatenate([tower_1, tower_2, tower_3], axis=1)
merged = Flatten()(merged)
out = Dense(200, activation='relu')(merged)
out = Dense(num_classes, activation='softmax')(out)
model = Model(input_shape, out)
plot_model(model, to_file=img_path)
return model
The image of this network will look like
My approach is to create other model that defines all parallel convolution and pulling operations and concat all parallel result tensors to single output tensor. Now you can add this parallel model graph in your sequential model just like layer. Here is my solution, hope it solves your problem.
# variable initialization
from keras import Input, Model, Sequential
from keras.layers import Conv2D, MaxPooling2D, Concatenate, Activation, Dropout, Flatten, Dense
nb_filters =100
kernel_size= {}
kernel_size[0]= [3,3]
kernel_size[1]= [4,4]
kernel_size[2]= [5,5]
input_shape=(32, 32, 3)
pool_size = (2,2)
nb_classes =2
no_parallel_filters = 3
# create seperate model graph for parallel processing with different filter sizes
# apply 'same' padding so that ll produce o/p tensor of same size for concatination
# cancat all paralle output
inp = Input(shape=input_shape)
convs = []
for k_no in range(len(kernel_size)):
conv = Conv2D(nb_filters, kernel_size[k_no][0], kernel_size[k_no][1],
border_mode='same',
activation='relu',
input_shape=input_shape)(inp)
pool = MaxPooling2D(pool_size=pool_size)(conv)
convs.append(pool)
if len(kernel_size) > 1:
out = Concatenate()(convs)
else:
out = convs[0]
conv_model = Model(input=inp, output=out)
# add created model grapg in sequential model
model = Sequential()
model.add(conv_model) # add model just like layer
model.add(Conv2D(nb_filters, kernel_size[1][0], kernel_size[1][0]))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=pool_size))
model.add(Dropout(0.25))
model.add(Flatten(input_shape=input_shape))
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(nb_classes))
model.add(Activation('tanh'))
For more information refer similar question: Combining the outputs of multiple models into one model
I have a multiclass classification problem. Say I have a feature matrix:
A B C D
1 -1 1 -6
2 0.5 0 11
7 3.7 1 1
4 -50 1 0
And labels:
LABEL
0
1
2
0
2
I want try to apply convolution kernels along each single feature row with Keras. Say nb_filter=2 and batch_size=3. So I expect input shape for convolution layer to be (3, 4) and output shape to be (3, 3) (as it is applied for AB, BC, CD).
Here what I tried with Keras (v1.2.1, Theano backend):
def CreateModel(input_dim, num_hidden_layers):
from keras.models import Sequential
from keras.layers import Dense, Dropout, Convolution1D, Flatten
model = Sequential()
model.add(Convolution1D(nb_filter=10, filter_length=1, input_shape=(1, input_dim), activation='relu'))
model.add(Dense(3, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
model.summary()
return model
def OneHotTransformation(y):
from keras.utils import np_utils
return np_utils.to_categorical(y)
X_train = X_train.values.reshape(X_train.shape[0], 1, X_train.shape[1])
X_test = X_test.values.reshape(X_test.shape[0], 1, X_test.shape[1]),
y_train = OneHotTransformation(y_train)
clf = KerasClassifier(build_fn=CreateModel, input_dim=X_train.shape[1], num_hidden_layers=1, nb_epoch=10, batch_size=500)
clf.fit(X_train, y_train)
Shapes:
print X_train.shape
print X_test.shape
print y_train.shape
Output:
(45561, 44)
(11391, 44)
(45561L,)
When I try to run this code I get and exception:
ValueError: Error when checking model target: expected dense_1 to have 3 dimensions, but got array with shape (45561L, 3L)
I tried to reshape y_train:
y_train = y_train.reshape(y_train.shape[0], 1, y_train.shape[1])
This gives me exception:
ValueError: Error when checking model target: expected dense_1 to have 3 dimensions, but got array with shape (136683L, 2L)
Is this approach with Convolution1D correct to achieve my goal?
If #1 is yes, how can I fix my code?
I've already read numerous github issues and some questions (1, 2) here, but it didn't really help.
Thanks.
UPDATE1:
According to Matias Valdenegro comment.
Here are shapes after reshaping 'X' and after onehot encoding for 'y':
print X_train.shape
print X_test.shape
print y_train.shape
Output:
(45561L, 1L, 44L)
(11391L, 1L, 44L)
(45561L, 3L)
UPDATE2: Thanks again to Matias Valdenegro. X reshaping is done after creating model for sure it was a copy paste issue. The code should look like:
def CreateModel(input_dim, num_hidden_layers):
from keras.models import Sequential
from keras.layers import Dense, Dropout, Convolution1D, Flatten
model = Sequential()
model.add(Convolution1D(nb_filter=10, filter_length=1, input_shape=(1, input_dim), activation='relu'))
model.add(Dense(3, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
model.summary()
return model
def OneHotTransformation(y):
from keras.utils import np_utils
return np_utils.to_categorical(y)
clf = KerasClassifier(build_fn=CreateModel, input_dim=X_train.shape[1], num_hidden_layers=1, nb_epoch=10, batch_size=500)
X_train = X_train.values.reshape(X_train.shape[0], 1, X_train.shape[1])
X_test = X_test.values.reshape(X_test.shape[0], 1, X_test.shape[1]),
y_train = OneHotTransformation(y_train)
clf.fit(X_train, y_train)
The input to a 1D convolution should have dimensions (num_samples, channels, width). So that means you need to reshape X_train and X_test, not y_train:
X_train = X_train.reshape(X_train.shape[0], 1, X_train.shape[1])
X_test = X_test.reshape(X_test.shape[0], 1, X_test.shape[1])
I want to define a function in Scala in which I can pass my training and test datasets and then it perform a simple machine learning algorithm and returns some statistics. How should do that? What will be the parameters data type?
Imagine, you need to define a function which by taking training and test datasets performs a simple classification algorithm and then return the accuracy.
What I expect to have is like as follow:
val data = MLUtils.loadLibSVMFile(sc, datadir + "/example.txt");
val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L);
val training = splits(0).cache();
val test = splits(1);
val results1 = SVMFunction(training, test)
val results2 = RegressionFunction(training, test)
val results3 = ClassificationFunction(training, test)
I need just the declaration of the functions and not the code that produce the results1, results2, and results3.
def SVMFunction ("I need help here"){
//I know how to work with the training and test datasets to generate the results.
//So no need to discuss what should be here
}
Thanks.
In case you're using supervised learning you should opt for LabeledPoint. Excerpt from mllib doc:
A labeled point is a local vector, either dense or sparse, associated with a label/response. In MLlib, labeled points are used in supervised learning algorithms. We use a double to store a label, so we can use labeled points in both regression and classification. For binary classification, a label should be either 0 (negative) or 1 (positive). For multiclass classification, labels should be class indices starting from zero: 0, 1, 2, ....
And example is:
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
// Create a labeled point with a positive label and a dense feature vector.
val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
// Create a labeled point with a negative label and a sparse feature vector.
val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))