Scipy sparse CSR matrix to TensorFlow SparseTensor - Mini-Batch gradient descent - scipy

I have a Scipy sparse CSR matrix created from sparse TF-IDF feature matrix in SVM-Light format. The number of features is huge and it is sparse so I have to use a SparseTensor or else it is too slow.
For example, number of features is 5, and a sample file can look like this:
0 4:1
1 1:3 3:4
0 5:1
0 2:1
After parsing, the training set looks like this:
trainX = <scipy CSR matrix>
trainY = np.array( [0,1,00] )
I have two important questions:
1) How I do convert this to a SparseTensor (sp_ids, sp_weights) efficiently so that I perform fast multiplication (W.X) using lookup: https://www.tensorflow.org/versions/master/api_docs/python/nn.html#embedding_lookup_sparse
2) How do I randomize the dataset at each epoch and recalculate sp_ids, sp_weights to so that I can feed (feed_dict) for the mini-batch gradient descent.
Example code on a simple model like logistic regression will be very appreciated. The graph will be like this:
# GRAPH
mul = tf.nn.embedding_lookup_sparse(W, X_sp_ids, X_sp_weights, combiner = "sum") # W.X
z = tf.add(mul, b) # W.X + b
cost_op = tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(z, y_true)) # this already has built in sigmoid apply
train_op = tf.train.GradientDescentOptimizer(0.05).minimize(cost_op) # construct optimizer
predict_op = tf.nn.sigmoid(z) # sig(W.X + b)

I can answer the first part of your question.
def convert_sparse_matrix_to_sparse_tensor(X):
coo = X.tocoo()
indices = np.mat([coo.row, coo.col]).transpose()
return tf.SparseTensor(indices, coo.data, coo.shape)
First you convert the matrix to COO format. Then you extract the indices, values, and shape and pass those directly to the SparseTensor constructor.

Related

Fitting a neural network with ReLUs to polynomial functions

Out of curiosity I am trying to fit neural network with rectified linear units to polynomial functions.
For example, I would like to see how easy (or difficult) it is for a neural network to come up with an approximation for the function f(x) = x^2 + x. The following code should be able to do it, but seems to not learn anything. When I run
using Base.Iterators: repeated
ENV["JULIA_CUDA_SILENT"] = true
using Flux
using Flux: throttle
using Random
f(x) = x^2 + x
x_train = shuffle(1:1000)
y_train = f.(x_train)
x_train = hcat(x_train...)
m = Chain(
Dense(1, 45, relu),
Dense(45, 45, relu),
Dense(45, 1),
softmax
)
function loss(x, y)
Flux.mse(m(x), y)
end
evalcb = () -> #show(loss(x_train, y_train))
opt = ADAM()
#show loss(x_train, y_train)
dataset = repeated((x_train, y_train), 50)
Flux.train!(loss, params(m), dataset, opt, cb = throttle(evalcb, 10))
println("Training finished")
#show m([20])
it returns
loss(x_train, y_train) = 2.0100101f14
loss(x_train, y_train) = 2.0100101f14
loss(x_train, y_train) = 2.0100101f14
Training finished
m([20]) = Float32[1.0]
Anyone here sees how I could make the network fit f(x) = x^2 + x?
There seem to be couple of things wrong with your trial that have mostly to do with how you use your optimizer and treat your input -- nothing wrong with Julia or Flux. Provided solution does learn, but is by no means optimal.
It makes no sense to have softmax output activation on a regression problem. Softmax is used in classification problems where the output(s) of your model represent probabilities and therefore should be on the interval (0,1). It is clear your polynomial has values outside this interval. It is usual to have linear output activation in regression problems like these. This means in Flux no output activation should be defined on the output layer.
The shape of your data matters. train! computes gradients for loss(d...) where d is a batch in your data. In your case a minibatch consists of 1000 samples, and this same batch is repeated 50 times. Neural nets are often trained with smaller batches sizes, but a larger sample set. In the code I provided all batches consist of different data.
For training neural nets, in general, it is advised to normalize your input. Your input takes values from 1 to 1000. My example applies a simple linear transformation to get the input data in the right range.
Normalization can also apply to the output. If the outputs are large, this can result in (too) large gradients and weight updates. Another approach is to lower the learning rate a lot.
using Flux
using Flux: #epochs
using Random
normalize(x) = x/1000
function generate_data(n)
f(x) = x^2 + x
xs = reduce(hcat, rand(n)*1000)
ys = f.(xs)
(normalize(xs), normalize(ys))
end
batch_size = 32
num_batches = 10000
data_train = Iterators.repeated(generate_data(batch_size), num_batches)
data_test = generate_data(100)
model = Chain(Dense(1,40, relu), Dense(40,40, relu), Dense(40, 1))
loss(x,y) = Flux.mse(model(x), y)
opt = ADAM()
ps = Flux.params(model)
Flux.train!(loss, ps, data_train, opt , cb = () -> #show loss(data_test...))

Keras Custom loss function to pass argument (numpy array containing Noise learnt) with same batch size as of y_true and y_pred

I have Implemented a custom loss function which takes in additional Noise ( numpy array) As illustrated below :
def custom_rcae_loss(self):
N = self.Noise
lambda_val = self.lamda[0]
mue = self.mue
self.batchNo += 1
index = self.batchNo
def custom_rcae(y_true, y_pred):
if(N.ndim >1):
term1 = keras.losses.mean_squared_error(y_true, (y_pred + N ))
The issue is that y_pred is of shape (batch_size, 28,28,1) :
How can I make sure my Noise is also of the same shape of y_pred?
Since I would like to perform (y_pred + Noise).
For instance: If my input is 5983 number of samples with a batch size of 128 There is not the same number of batch_size splits.
How can we address this issue while using keras for making sure Noise is of the same shape of y_pred
Looking forward to suggestions and hints
Thanks in advance

Merging two tensors by convolution in Keras

I'm trying to convolve two 1D tensors in Keras.
I get two inputs from other models:
x - of length 100
ker - of length 5
I would like to get the 1D convolution of x using the kernel ker.
I wrote a Lambda layer to do it:
import tensorflow as tf
def convolve1d(x):
y = tf.nn.conv1d(value=x[0], filters=x[1], padding='VALID', stride=1)
return y
x = Input(shape=(100,))
ker = Input(shape=(5,))
y = Lambda(convolve1d)([x,ker])
model = Model([x,ker], [y])
I get the following error:
ValueError: Shape must be rank 4 but is rank 3 for 'lambda_67/conv1d/Conv2D' (op: 'Conv2D') with input shapes: [?,1,100], [1,?,5].
Can anyone help me understand how to fix it?
It was much harder than I expected because Keras and Tensorflow don't expect any batch dimension in the convolution kernel so I had to write the loop over the batch dimension myself, which requires to specify batch_shape instead of just shape in the Input layer. Here it is :
import numpy as np
import tensorflow as tf
import keras
from keras import backend as K
from keras import Input, Model
from keras.layers import Lambda
def convolve1d(x):
input, kernel = x
output_list = []
if K.image_data_format() == 'channels_last':
kernel = K.expand_dims(kernel, axis=-2)
else:
kernel = K.expand_dims(kernel, axis=0)
for i in range(batch_size): # Loop over batch dimension
output_temp = tf.nn.conv1d(value=input[i:i+1, :, :],
filters=kernel[i, :, :],
padding='VALID',
stride=1)
output_list.append(output_temp)
print(K.int_shape(output_temp))
return K.concatenate(output_list, axis=0)
batch_input_shape = (1, 100, 1)
batch_kernel_shape = (1, 5, 1)
x = Input(batch_shape=batch_input_shape)
ker = Input(batch_shape=batch_kernel_shape)
y = Lambda(convolve1d)([x,ker])
model = Model([x, ker], [y])
a = np.ones(batch_input_shape)
b = np.ones(batch_kernel_shape)
c = model.predict([a, b])
In the current state :
It doesn't work for inputs (x) with multiple channels.
If you provide several filters, you get as many outputs, each being the convolution of the input with the corresponding kernel.
From given code it is difficult to point out what you mean when you say
is it possible
But if what you mean is to merge two layers and feed merged layer to convulation, yes it is possible.
x = Input(shape=(100,))
ker = Input(shape=(5,))
merged = keras.layers.concatenate([x,ker], axis=-1)
y = K.conv1d(merged, 'same')
model = Model([x,ker], y)
EDIT:
#user2179331 thanks for clarifying your intention. Now you are using Lambda Class incorrectly, that is why the error message is showing.
But what you are trying to do can be achieved using keras.backend layers.
Though be noted that when using lower level layers you will lose some higher level abstraction. E.g when using keras.backend.conv1d you need to have input shape of (BATCH_SIZE,width, channels) and kernel with shape of (kernel_size,input_channels,output_channels). So in your case let as assume the x has channels of 1(input channels ==1) and y also have the same number of channels(output channels == 1).
So your code now can be refactored as follows
from keras import backend as K
def convolve1d(x,kernel):
y = K.conv1d(x,kernel, padding='valid', strides=1,data_format="channels_last")
return y
input_channels = 1
output_channels = 1
kernel_width = 5
input_width = 100
ker = K.variable(K.random_uniform([kernel_width,input_channels,output_channels]),K.floatx())
x = Input(shape=(input_width,input_channels)
y = convolve1d(x,ker)
I guess I have understood what you mean. Given the wrong example code below:
input_signal = Input(shape=(L), name='input_signal')
input_h = Input(shape=(N), name='input_h')
faded= Lambda(lambda x: tf.nn.conv1d(input, x))(input_h)
You want to convolute each signal vector with different fading coefficients vector.
The 'conv' operation in TensorFlow, etc. tf.nn.conv1d, only support a fixed value kernel. Therefore, the code above can not run as you want.
I have no idea, too. The code you given can run normally, however, it is too complex and not efficient. In my idea, another feasible but also inefficient way is to multiply with the Toeplitz matrix whose row vector is the shifted fading coefficients vector. When the signal vector is too long, the matrix will be extremely large.

Bicoin price prediction using spark and scala [duplicate]

I am new to Apache Spark and trying to use the machine learning library to predict some data. My dataset right now is only about 350 points. Here are 7 of those points:
"365","4",41401.387,5330569
"364","3",51517.886,5946290
"363","2",55059.838,6097388
"362","1",43780.977,5304694
"361","7",46447.196,5471836
"360","6",50656.121,5849862
"359","5",44494.476,5460289
Here's my code:
def parsePoint(line):
split = map(sanitize, line.split(','))
rev = split.pop(-2)
return LabeledPoint(rev, split)
def sanitize(value):
return float(value.strip('"'))
parsedData = textFile.map(parsePoint)
model = LinearRegressionWithSGD.train(parsedData, iterations=10)
print model.predict(parsedData.first().features)
The prediction is something totally crazy, like -6.92840330273e+136. If I don't set iterations in train(), then I get nan as a result. What am I doing wrong? Is it my data set (the size of it, maybe?) or my configuration?
The problem is that LinearRegressionWithSGD uses stochastic gradient descent (SGD) to optimize the weight vector of your linear model. SGD is really sensitive to the provided stepSize which is used to update the intermediate solution.
What SGD does is to calculate the gradient g of the cost function given a sample of the input points and the current weights w. In order to update the weights w you go for a certain distance in the opposite direction of g. The distance is your step size s.
w(i+1) = w(i) - s * g
Since you're not providing an explicit step size value, MLlib assumes stepSize = 1. This seems to not work for your use case. I'd recommend you to try different step sizes, usually lower values, to see how LinearRegressionWithSGD behaves:
LinearRegressionWithSGD.train(parsedData, numIterartions = 10, stepSize = 0.001)

basic help using hmm to clasify a sequence

I am very new to matlab, hidden markov model and machine learning, and am trying to classify a given sequence of signals. Please let me know if the approach I have followed is correct:
create a N by N transition matrix and fill with random values which sum to 1for each row. (N will be the number of states)
create a N by M emission/observation matrix and fill with random values which sum to 1 for each row
convert different instances of the sequence (i.e each instance will be saying the word 'hello' ) into one long stream and feed each stream to the hmm train function such that:
new_transition_matrix old_transition_matrix = hmmtrain(sequence,old_transition_matrix,old_emission_matrix)
give the final transition and emission matrix to hmm decode with an unknown sequence to give the probability
i.e [posterior_states logrithmic_probability] = hmmdecode( sequence, final_transition_matrix,final_emission_matris)
1. and 2. are correct. You have to be careful that your initial transition and emission matrices are not completely uniform, they should be slightly randomized for the training to work.
3. I would just feed in the 'Hello' sequences separately rather than concatenating them to form a single long sequence.
Let's say this is the sequence for Hello: [1,0,1,1,0,0]. If you form one long sequence from 3 'Hello' sequences, you would get:
data = [1,0,1,1,0,0,1,0,1,1,0,0,1,0,1,1,0,0]
This is not ideal, instead you should feed the sequences in separately like:
data = [1,0,1,1,0,0; 1,0,1,1,0,0; 1,0,1,1,0,0].
Since you are using MatLab, I would recommend using the HMM toolbox by Murphy. It has a demo on how you can train an HMM with multiple observation sequences:
M = 3;
N = 2;
% "true" parameters
prior0 = normalise(rand(N ,1));
transmat0 = mk_stochastic(rand(N ,N ));
obsmat0 = mk_stochastic(rand(N ,M));
% training data: a 5*6 matrix, e.g. 5 different 'Hello' sequences of length 6
number_of_seq = 5;
seq_len= 6;
data = dhmm_sample(prior0, transmat0, obsmat0, number_of_seq, seq_len);
% initial guess of parameters
prior1 = normalise(rand(N ,1));
transmat1 = mk_stochastic(rand(N ,N ));
obsmat1 = mk_stochastic(rand(N ,M));
% improve guess of parameters using EM
[LL, prior2, transmat2, obsmat2] = dhmm_em(data, prior1, transmat1, obsmat1, 'max_iter', 5);
LL
4. What you say is correct, below is how you calculate the log probaility in the HMM toolbox:
% use model to compute log[P(Obs|model)]
loglik = dhmm_logprob(data, prior2, transmat2, obsmat2)
Finally: Have a look at this paper by Rabiner on how the mathematics work if anything is unclear.
Hope this helps.