How to set proper arguments to build keras Convolution2D NN model [Text Classification]? - neural-network

I am trying to use 2D CNN to do text classification on Chinese Article and have trouble on setting arguments of keras Convolution2D. I know the basic flow of Convolution2D to cope with image, but stuck by using my dataset with keras.
Input data
My data is 9800 Chinese Article, max sentence length is 6810,with 200 word2vec size.
So the input shape is `(9800, 1, 6810, 200)`
Code for building model
MAX_FEATURES = 6810
# I just randomly pick one filter, seems this is the problem?
nb_filter = 128
input_shape = (1, 6810, 200)
# each word is 200 (word2vec size)
embedding_size = 200
# 3 word length
n_gram = 3
# so stride here is embedding_size*n_gram
model = Sequential()
model.add(Convolution2D(nb_filter, n_gram, embedding_size, border_mode='valid', input_shape=input_shape))
model.add(MaxPooling2D(pool_size=(100, 1), border_mode='valid'))
model.add(Dropout(0.5))
model.add(Activation('relu'))
model.add(Flatten())
model.add(Dense(hidden_dims))
model.add(Dropout(0.5))
model.add(Activation('relu'))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
# X is (9800, 1, 6810, 200)
model.fit(X, y, batch_size=32,
nb_epoch=5,
validation_split=0.1)
Question 1. I have problem to set Convolution2D arguments. My reseach is below,
The official docs do not contain an exmaple for 2D CNN text classifacation(though has 1D CNN).
Convolution2D defination is here https://keras.io/layers/convolutional/:
keras.layers.convolutional.Convolution2D(nb_filter, nb_row, nb_col, init='glorot_uniform', activation=None, weights=None, border_mode='valid', subsample=(1, 1), dim_ordering='default', W_regularizer=None, b_regularizer=None, activity_regularizer=None, W_constraint=None, b_constraint=None, bias=True)
nb_filter: Number of convolution filters to use.
nb_row: Number of rows in the convolution kernel.
nb_col: Number of columns in the convolution kernel.
border_mode: 'valid', 'same' or 'full'. ('full' requires the Theano backend.)
Some research about the arguments:
This issue https://github.com/fchollet/keras/issues/233 is about 2D CNN for text classification, I read all comments and pick:
(1) https://github.com/fchollet/keras/issues/233#issuecomment-117427013
model.add(Convolution2D(nb_filter=N_FILTERS, stack_size=1, nb_row=FIELD_SIZE,
nb_col=1, subsample=(STRIDE, 1)))
(2) https://github.com/fchollet/keras/issues/233#issuecomment-117700913
sequential.add(Convolution2D(nb_feature_maps, 1, n_gram, embedding_size))
But it seems has some diference to current keras version, also the arguments naming by different people are in a mess (I hope keras has an easy understandable argument expanation).
Another comment I see about current api:
https://github.com/fchollet/keras/issues/1665#issuecomment-181181000
The current API is as below:
keras.layers.convolutional.Convolution2D(nb_filter, nb_row, nb_col, init='glorot_uniform', activation='linear', weights=None, border_mode='valid', subsample=(1, 1), dim_ordering='th', W_regularizer=None, b_regularizer=None, activity_regularizer=None, W_constraint=None, b_constraint=None)
So (36,1,7,7) seems the reason, the correct arguments would be (36,7,7,...).
By above research, on my understanding of convolution, Convolution2D create a (nb_filter, nb_row, nb_col) filter , by sliding a stride to get one filter result, recurse sliding, finally combine the result into array with shape (1, one_sample_article_length[6810] / nb_filter), and go to the next layer, is that right? Is my code below set nb_row and nb_col correct ?
Question 2. What is the proper MaxPooling2D arguments? (for my dateset or for commonm, either is OK)
I refer this issue https://github.com/fchollet/keras/issues/233#issuecomment-117427013 to set the argument, there are two kinds:
MaxPooling2D(poolsize=(((nb_features - FIELD_SIZE) / STRIDE) + 1, 1))
MaxPooling2D(poolsize=(maxlen - n_gram + 1, 1))
I have no idea why they calculate MaxPooling2D argument like that.
Question 3. Any recommendation for batch_size and nb_epoch to do such text classification? I have no idea at all.

Related

How to deal with overfitting with simple (X,Y) data in MLPRegressor

Result
Result2
Solved
Dealing with low amounts of data, and dealing with overfitting w/ Folding[GridSearchCV]
I am completely stumped as to how to get better estimations from my model. It seems that when I try to run my code, I get negative Accuracies. How can I improve cross_val_score or testing scores or whatever you want to call it such that I can predict values more reliably.
I tried adding more data (from 50 to 200+).
I tried random parameters (and realized this was a Naive approach)
I also tried cleaning my data w/ StandardScaler on the features
Anyone have any suggestions?
from sklearn.neural_network import MLPRegressor
from sklearn import preprocessing
import requests
import json
from calendar import monthrange
import numpy as np
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.preprocessing import scale
r =requests.get('https://www.alphavantage.co/query?function=TIME_SERIES_WEEKLY_ADJUSTED&symbol=W&apikey=QYQ2D6URDOKNUGF4')
#print(r.text)
y = json.loads(r.text)
#print(y["Monthly Adjusted Time Series"].keys())
keysInResultSet = y["Weekly Adjusted Time Series"].keys()
#print(keysInResultSet)
featuresListTemp = []
labelsListTemp = []
count = 0;
for i in keysInResultSet:
#print(i)
count = count + 1;
#print(y["Monthly Adjusted Time Series"][i])
tmpList = []
tmpList.append(count)
featuresListTemp.append(tmpList)
strValue = y["Weekly Adjusted Time Series"][i]["5. adjusted close"]
numValue = float(strValue)
labelsListTemp.append(numValue)
print("TOTAL SET")
print(featuresListTemp)
print(labelsListTemp)
print("---")
arrTestInput = []
arrTestOutput = []
print("SCALING SET")
X_train = np.array(featuresListTemp)
scaler = preprocessing.StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
print(X_train_scaled)
product_model = MLPRegressor()
#10.0 ** -np.arange(1, 10)
#todo : once found general settings, iterate through some more seeds to find one that can be used on the training
parameters = {'learning_rate': ['constant','adaptive'],'solver': ['lbfgs','adam'], 'tol' : 10.0 ** -np.arange(1, 4), 'verbose' : [True], 'early_stopping': [True], 'activation' : ['tanh','logistic'], 'learning_rate_init': 10.0 ** -np.arange(1, 4), 'max_iter': [4000], 'alpha': 10.0 ** -np.arange(1, 4), 'hidden_layer_sizes':np.arange(1,11), 'random_state':np.arange(1, 3)}
clf = GridSearchCV(product_model, parameters, n_jobs=-1)
clf.fit(X_train_scaled, labelsListTemp)
print(clf.score(X_train_scaled, labelsListTemp))
print(clf.best_params_)
best_params = clf.best_params_
newPM = MLPRegressor(hidden_layer_sizes=((best_params['hidden_layer_sizes'])), #try reducing the layer size / increasing it and playing around with resultFit variable
batch_size='auto',
power_t=0.5,
activation=best_params['activation'],
solver=best_params['solver'], #non scaled input
learning_rate=best_params['learning_rate'],
max_iter=best_params['max_iter'],
learning_rate_init=best_params['learning_rate_init'],
alpha=best_params['alpha'],
random_state=best_params['random_state'],
early_stopping=best_params['early_stopping'],
tol=best_params['tol'])
scores = cross_val_score(newPM, X_train_scaled, labelsListTemp, cv=10, scoring='neg_mean_absolute_error')
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
print(scores)
Output from line 63 and down
0.9142644531564619 {'activation': 'logistic', 'alpha': 0.001, 'early_stopping': True, 'hidden_layer_sizes': 7, 'learning_rate':
'constant', 'learning_rate_init': 0.1, 'max_iter': 4000,
'random_state': 2, 'solver': 'lbfgs', 'tol': 0.01, 'verbose': True}
Accuracy: -21.91 (+/- 58.89) [ -32.87854574 -105.0632913
-22.89836453 -7.33154414 -22.38773819 -3.3786339 -1.7658796 -3.78002866 -4.78734308 -14.81212738]
{'activation': 'logistic', 'alpha': 0.01, 'early_stopping': True, 'hidden_layer_sizes': 30, 'learning_rate': 'constant', 'learning_rate_init': 0.1, 'max_iter': 4000, 'random_state': 2, 'solver': 'lbfgs', 'tol': 0.1, 'verbose': True}
{'activation': 'tanh', 'alpha': 0.01, 'early_stopping': True, 'hidden_layer_sizes': 99, 'learning_rate': 'constant', 'learning_rate_init': 0.1, 'max_iter': 4000, 'random_state': 1, 'solver': 'lbfgs', 'tol': 0.01, 'verbose': True}
Both configurations stated above will work for the sample set. Thanks all, please let me know if there are any questions. This can be solved by scaling down all your other parameters ie. instead of 10.0 ** -np.arange(1, 3) do 10.0 ** -np.arange(1, 2)
to a more limited set. Start removing parameters that you know are correct (very hard to do, but one could be learning_rate='constant' as I noticed that all my best fits resulted in a learning rate that was constant, regardless of any other parameters.
This is mostly for time optimization but will also help with overfitting as you increase the number of nodes in the network. The idea is that you want to increase the fit some N degrees without losing too much of the generalization properties of the true function) once you perform your first grid search.
You should start you grid search making sure that the # of hidden nodes is some where between the # of input nodes and the # of output nodes.
Once you find a decent fit, you can improve the fit by increasing the number of nodes. You must take care not to add too many nodes as to lose the generalization power of the true function. Before you even start thinking about scaling up, you must start reducing the complexity of the parameters such that on your second grid search you will be performing it on an increased number of nodes w/ more general parameters.
The generalization of parameters is described above with the second grid search taking into account more general parameters from the initial search, whilst increasing the network nodes.
I know this is confusing but it's what helped me fit this decently.
For anyone struggling I would try to
0) generalize after performing a search and getting a decent model
1) use generalization on second search with increased nodes
2) play with alpha parameter while scaling up (the rest of the parameters you can generalize)
3) add a few different seeds or remove them depending on the situation
4) While changing tol will alter fit it is also highly dependent on the number of iterations. For that reason, depending on the case, a reasonable number might be .01 or .001 (reasonable depending on how many iterations you want to wait for a given result/ opportunity to converge) If the tol is set too low, you will run out of iterations as each epoch will never get a chance to stop early.

Inputs to Encoder-Decoder LSTMCell/RNN Network

I'm creating an LSTM Encoder-Decoder Network, using Keras, following the code provided here: https://github.com/LukeTonin/keras-seq-2-seq-signal-prediction. The only change I made is to replace the GRUCell with an LSTMCell. Basically both the encoder and decoder consists of 2 layers, of 35 LSTMCells. The layers are stacked over (and combined with) each other using an RNN Layer.
The LSTMCell returns 2 states whereas the GRUCell returns 1 state. This is where I am encountering an error, as I do not know how to code for the 2 returned states of the LSTMCell.
I have created two models: first, an encoder-decoder model. Second, a prediction model. I am not encountering any problems in the encoder-decoder model, but a encountering problems in the decoder of the prediction model.
The error I am getting is:
ValueError: Layer rnn_4 expects 9 inputs, but it received 3 input tensors. Input received: [<tf.Tensor 'input_4:0' shape=(?, ?, 1) dtype=float32>, <tf.Tensor 'input_11:0' shape=(?, 35) dtype=float32>, <tf.Tensor 'input_12:0' shape=(?, 35) dtype=float32>]
This error happens when this line below, in the prediction model, is run:
decoder_outputs_and_states = decoder(
decoder_inputs, initial_state=decoder_states_inputs)
The section of code this fits into is:
encoder_predict_model = keras.models.Model(encoder_inputs,
encoder_states)
decoder_states_inputs = []
# Read layers backwards to fit the format of initial_state
# For some reason, the states of the model are order backwards (state of the first layer at the end of the list)
# If instead of a GRU you were using an LSTM Cell, you would have to append two Input tensors since the LSTM has 2 states.
for hidden_neurons in layers[::-1]:
# One state for GRU, but two states for LSTMCell
decoder_states_inputs.append(keras.layers.Input(shape=(hidden_neurons,)))
decoder_outputs_and_states = decoder(
decoder_inputs, initial_state=decoder_states_inputs)
decoder_outputs = decoder_outputs_and_states[0]
decoder_states = decoder_outputs_and_states[1:]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_predict_model = keras.models.Model(
[decoder_inputs] + decoder_states_inputs,
[decoder_outputs] + decoder_states)
Could somebody help me with the for loop above, and initial states I should be passing the decoder after that?
I had an similar error and i solved just doing what he says, adding another input tensor:
# If instead of a GRU you were using an LSTM Cell, you would have to append two Input tensors since the LSTM has 2 states.
for hidden_neurons in layers[::-1]:
# One state for GRU
decoder_states_inputs.append(keras.layers.Input(shape=(hidden_neurons,)))
decoder_states_inputs.append(keras.layers.Input(shape=(hidden_neurons,)))
here it solved the prolem...

What's the purpose of nb_epoch in Keras's fit_generator?

It seems like I could get the exact same result by making num_samples bigger and keeping nb_epoch=1. I thought the purpose of multiple epochs was to iterate over the same data multiple times, but Keras doesn't reinstantiate the generator at the end of each epoch. It just keeps going. For example training this autoencoder:
import numpy as np
from keras.layers import (Convolution2D, MaxPooling2D,
UpSampling2D, Activation)
from keras.models import Sequential
rand_imgs = [np.random.rand(1, 100, 100, 3) for _ in range(1000)]
def keras_generator():
i = 0
while True:
print(i)
rand_img = rand_imgs[i]
i += 1
yield (rand_img, rand_img)
layers = ([
Convolution2D(20, 5, 5, border_mode='same',
input_shape=(100, 100, 3), activation='relu'),
MaxPooling2D((2, 2), border_mode='same'),
Convolution2D(3, 5, 5, border_mode='same', activation='relu'),
UpSampling2D((2, 2)),
Convolution2D(3, 5, 5, border_mode='same', activation='relu')])
autoencoder = Sequential()
for layer in layers:
autoencoder.add(layer)
gen = keras_generator()
autoencoder.compile(optimizer='adadelta', loss='binary_crossentropy')
history = autoencoder.fit_generator(gen, samples_per_epoch=100, nb_epoch=2)
It seems like I get the same result with (samples_per_epoch=100, nb_epoch=2) as I do for (samples_per_epoch=200, nb_epoch=1). Am I using fit_generator as intended?
Yes - you are right that when using keras.fit_generator these two approaches are equivalent. But - there are variety of reasons why keeping epochs is reasonable:
Logging: in this case epoch comprises the amount of data after which you want to log some important statistics about training (like e.g. time or loss at the end of the epoch).
Keeping directory structure when you are using generator to load data from your hard disk - in this case - when you know how many files you have in your directory - you may adjust the batch_size and nb_epoch to such values that epoch would comprise going through every example in your dataset.
Keeping the structure of data when using flow generator - in this case, when you have e.g. a set of pictures loaded to your Python and you want to use Keras.ImageDataGenerator to apply different kind of data transformations, setting batch_size and nb_epoch in such way that epoch comprises going through every example in your dataset might help you in keeping track of a progress of your trainning process.

Deconv implementation in keras output_shape issue

I am implementing following Colorization Model written in Caffe. I am confused about my output_shape parameter to supply in Keras
model.add(Deconvolution2D(256,4,4,border_mode='same',
output_shape=(None,3,14,14),subsample=(2,2),dim_ordering='th',name='deconv_8.1'))
I have added a dummy output_shape parameter. But how can I determine the output parameter? In caffe model the layer is defined as:
layer {
name: "conv8_1"
type: "Deconvolution"
bottom: "conv7_3norm"
top: "conv8_1"
convolution_param {
num_output: 256
kernel_size: 4
pad: 1
dilation: 1
stride: 2
}
If I do not supply this parameter the code give parameter error but I can not understand what should I supply as output_shape
p.s. already asked on data science forum page with no response. may be due to small user base
What output shape does the Caffe deconvolution layer produce?
For this colorization model in particular you can simply refer to page 24 of their paper (which is linked in their GitHub page):
So basically the output shape of this deconvolution layer in the original model is [None, 56, 56, 128]. This is what you want to pass to Keras as output_shape. The only problem is as I mention in the section below, Keras doesn't really use this parameter to determine the output shape, so you need to run a dummy prediction to find what your other parameters need to be in order for you to get what you want.
More generally the Caffe source code for computing its Deconvolution layer output shape is:
const int kernel_extent = dilation_data[i] * (kernel_shape_data[i] - 1) + 1;
const int output_dim = stride_data[i] * (input_dim - 1)
+ kernel_extent - 2 * pad_data[i];
Which with a dilation argument equal to 1 reduces to just:
const int output_dim = stride_data[i] * (input_dim - 1)
+ kernel_shape_data[i] - 2 * pad_data[i];
Note that this matches the Keras documentation when the parameter a is zero:
Formula for calculation of the output shape 3, 4: o = s (i - 1) +
a + k - 2p
How to verify actual output shape with your Keras backend
This is tricky, because the actual output shape depends on the backend implementation and configuration. Keras is currently unable to find it on its own. So you actually have to execute a prediction on some dummy input to find the actual output shape. Here's an example of how to do this from the Keras docs for Deconvolution2D:
To pass the correct `output_shape` to this layer,
one could use a test model to predict and observe the actual output shape.
# Examples
```python
# apply a 3x3 transposed convolution with stride 1x1 and 3 output filters on a 12x12 image:
model = Sequential()
model.add(Deconvolution2D(3, 3, 3, output_shape=(None, 3, 14, 14), border_mode='valid', input_shape=(3, 12, 12)))
# Note that you will have to change the output_shape depending on the backend used.
# we can predict with the model and print the shape of the array.
dummy_input = np.ones((32, 3, 12, 12))
# For TensorFlow dummy_input = np.ones((32, 12, 12, 3))
preds = model.predict(dummy_input)
print(preds.shape)
# Theano GPU: (None, 3, 13, 13)
# Theano CPU: (None, 3, 14, 14)
# TensorFlow: (None, 14, 14, 3)
Reference: https://github.com/fchollet/keras/blob/master/keras/layers/convolutional.py#L507
Also you might be curious to know why is it that the output_shape parameter apparently doesn't really define the output shape. According to the post Deconvolution2D layer in keras this is why:
Back to Keras and how the above is implemented. Confusingly, the output_shape parameter is actually not used for determining the output shape of the layer, and instead they try to deduce it from the input, the kernel size and the stride, while assuming only valid output_shapes are supplied (though it's not checked in the code to be the case). The output_shape itself is only used as input to the backprop step. Thus, you must also specify the stride parameter (subsample in Keras) in order to get the desired result (which could've been determined by Keras from the given input shape, output shape and kernel size).

Keras deep autoencoder prediction is inaccurate

I am using a Keras deep autoencoder to reproduce my sparse matrix of [360, 6860] dimension. Each row is the count of trigrams for a protein sequence. The matrix has 2 classes of proteins, but I want the network to be ignorant of that initially, that is why I am using an autoencoder. I am following the keras blog autoencoder tutorial for this.
This is my code-
# this is the size of our encoded representations
encoding_dim = 32
input_img = Input(shape=(6860,))
encoded = Dense(128, activation='relu', activity_regularizer=regularizers.activity_l1(10e-5))(input_img)
encoded = Dense(64, activation='relu')(encoded)
encoded = Dense(32, activation='relu')(encoded)
decoded = Dense(64, activation='relu')(encoded)
decoded = Dense(128, activation='relu')(decoded)
decoded = Dense(6860, activation='sigmoid')(decoded)
autoencoder = Model(input=input_img, output=decoded)
# this model maps an input to its encoded representation
encoder = Model(input=input_img, output=encoded)
# create a placeholder for an encoded (32-dimensional) input
encoded_input_1 = Input(shape=(32,))
encoded_input_2 = Input(shape=(64,))
encoded_input_3 = Input(shape=(128,))
# retrieve the last layer of the autoencoder model
decoder_layer_1 = autoencoder.layers[-3]
decoder_layer_2 = autoencoder.layers[-2]
decoder_layer_3 = autoencoder.layers[-1]
# create the decoder model
decoder_1 = Model(input = encoded_input_1, output = decoder_layer_1(encoded_input_1))
decoder_2 = Model(input = encoded_input_2, output = decoder_layer_2(encoded_input_2))
decoder_3 = Model(input = encoded_input_3, output = decoder_layer_3(encoded_input_3))
autoencoder.compile(optimizer='adadelta', loss='binary_crossentropy')
autoencoder.fit(x_train, x_train,
nb_epoch= 100,
batch_size=40,
shuffle=True,
validation_data=(x_test, x_test))
My validation set dimension is [80, 6860]. The problem is if I use the decoder to predict from the test set, my predictions are really off. For example if I predict with the following code-
# encode and decode some digits
# note that we take them from the *test* set
encoded_imgs = encoder.predict(x_test)
decoded_imgs = decoder_1.predict(encoded_imgs)
decoded_imgs = decoder_2.predict(decoded_imgs)
decoded_imgs = decoder_3.predict(decoded_imgs)
print x_test[3, np.where(x_test[3, :] != 0)[0]]
print (decoded_imgs[3, np.where(x_test[3, :] != 0)[0]])
a single row of my test set where the values are not zero are-
[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 2. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
for the same row, the autoencoder's prediction of the same indices are-
[ 0.04615583 0.04613763 0.10268984 0.00286385 0.0030572 0.02551027
0.00552908 0.09686473 0.02554915 0.0082816 0.02254158 0.01127195
0.00305908 0.17113154 0.01140419 0.03370495 0.00515486 0.02614204
0.00558715 0.02835727 0.0029659 0.01425297 0.00834536 0.04502939
0.02260707 0.01131396 0.00561662 0.01131314 0.00493734 0.00265232
0.0056083 0.01724379 0.06099484 0.03738695 0.01128869 0.01995548
0.00562622 0.00556281 0.01732991 0.03142899 0.05339266 0.04778111
0.00292415 0.02264618 0.01419865 0.00550648 0.00836777 0.01139715]
Now, first I thought, maybe I can use some kind of thresholding to get the 1's from these values. But it seems they are pretty random. For a single row, for the first 50 zero values for my test set, my autoencoder predicts-
[ 0.14251608 0.00118295 0.00118732 0.00304095 0.031255 0.00108441
0.0201351 0.00853934 0.00558488 0.00281343 0.00296877 0.00109651
0.01129742 0.00827519 0.0170884 0.01417614 0.01714166 0.00549215
0.00099755 0.00558552 0.00829634 0.01988331 0.00092845 0.00294271
0.01429107 0.01137067 0.01137967 0.01121876 0.00491931 0.00562285
0.0055124 0.01720702 0.0142925 0.00553411 0.00551252 0.00281541
0.01145663 0.002876 0.00555185 0.00525392 0.01421779 0.00273949
0.01698892 0.02529835 0.0112521 0.01130333 0.00554186 0.00291986
0.00554437 0.01144382]
How can I improve the predictions? What am I doing wrong here? I must say that the data is hugely sparse. If you want you can download the toy data from here. Please, let me know if you have any questions.
One of the most important reasons is probably your training data size is just too small. You have a fully connected network and thus with 7 layers (including input and output) the number of parameters are just huge, close to 1.8M. You only have 360 training samples. So basically the parameters are untrained.
You can improve your work in two ways. One is of course to get more training data. The second is to follow the CNN example in the later part of the tutorial. CNN has been popular since it can greatly reduce the number of parameters.