Problem understanding Loss function behavior using Flux.jl. in Julia - neural-network

So. First of all, I am new to Neural Network (NN).
As part of my PhD, I am trying to solve some problem through NN.
For this, I have created a program that creates some data set made of
a collection of input vectors (each with 63 elements) and its corresponding
output vectors (each with 6 elements).
So, my program looks like this:
Nₜᵣ = 25; # number of inputs in the data set
xtrain, ytrain = dataset_generator(Nₜᵣ); # generates In/Out vectors: xtrain/ytrain
datatrain = zip(xtrain,ytrain); # ensamble my data
Now, both xtrain and ytrain are of type Array{Array{Float64,1},1}, meaning that
if (say)Nₜᵣ = 2, they look like:
julia> xtrain #same for ytrain
2-element Array{Array{Float64,1},1}:
[1.0, -0.062, -0.015, -1.0, 0.076, 0.19, -0.74, 0.057, 0.275, ....]
[0.39, -1.0, 0.12, -0.048, 0.476, 0.05, -0.086, 0.85, 0.292, ....]
The first 3 elements of each vector is normalized to unity (represents x,y,z coordinates), and the following 60 numbers are also normalized to unity and corresponds to some measurable attributes.
The program continues like:
layer1 = Dense(length(xtrain[1]),46,tanh); # setting 6 layers
layer2 = Dense(46,36,tanh) ;
layer3 = Dense(36,26,tanh) ;
layer4 = Dense(26,16,tanh) ;
layer5 = Dense(16,6,tanh) ;
layer6 = Dense(6,length(ytrain[1])) ;
m = Chain(layer1,layer2,layer3,layer4,layer5,layer6); # composing the layers
squaredCost(ym,y) = (1/2)*norm(y - ym).^2;
loss(x,y) = squaredCost(m(x),y); # define loss function
ps = Flux.params(m); # initializing mod.param.
opt = ADAM(0.01, (0.9, 0.8)); #
and finally:
trainmode!(m,true)
itermax = 700; # set max number of iterations
losses = [];
for iter in 1:itermax
Flux.train!(loss,ps,datatrain,opt);
push!(losses, sum(loss.(xtrain,ytrain)));
end
It runs perfectly, however, it comes to my attention that as I train my model with an increasing data set(Nₜᵣ = 10,15,25, etc...), the loss function seams to increase. See the image below:
Where: y1: Nₜᵣ=10, y2: Nₜᵣ=15, y3: Nₜᵣ=25.
So, my main question:
Why is this happening?. I can not see an explanation for this behavior. Is this somehow expected?
Remarks: Note that
All elements from the training data set (input and output) are normalized to [-1,1].
I have not tryed changing the activ. functions
I have not tryed changing the optimization method
Considerations: I need a training data set of near 10000 input vectors, and so I am expecting an even worse scenario...
Some personal thoughts:
Am I arranging my training dataset correctly?. Say, If every single data vector is made of 63 numbers, is it correctly to group them in an array? and then pile them into an ´´´Array{Array{Float64,1},1}´´´?. I have no experience using NN and flux. How can I made a data set of 10000 I/O vectors differently? Can this be the issue?. (I am very inclined to this)
Can this behavior be related to the chosen act. functions? (I am not inclined to this)
Can this behavior be related to the opt. algorithm? (I am not inclined to this)
Am I training my model wrong?. Is the iteration loop really iterations or are they epochs. I am struggling to put(differentiate) this concept of "epochs" and "iterations" into practice.

loss(x,y) = squaredCost(m(x),y); # define loss function
Your losses aren't normalized, so adding more data can only increase this cost function. However, the cost per data doesn't seem to be increasing. To get rid of this effect, you might want to use a normalized cost function by doing something like using the mean squared cost.

Related

MSELoss when mask is used

I'm trying to calculate MSELoss when mask is used. Suppose that I have tensor with batch_size of 2: [2, 33, 1] as my target, and another input tensor with the same shape. Since sequence length might differ for each instance, I have also a binary mask indicating the existence of each element in the input sequence. So here is what I'm doing:
mse_loss = nn.MSELoss(reduction='none')
loss = mse_loss(input, target)
loss = (loss * mask.float()).sum() # gives \sigma_euclidean over unmasked elements
mse_loss_val = loss / loss.numel()
# now doing backpropagation
mse_loss_val.backward()
Is loss / loss.numel() a good practice? I'm skeptical, as I have to use reduction='none' and when calculating final loss value, I think I should calculate the loss only considering those loss elements that are nonzero (i.e., unmasked), however, I'm taking the average over all tensor elements with torch.numel(). I'm actually trying to take 1/n factor of MSELoss into account. Any thoughts?
There are some issues in the code. I think correct code should be:
mse_loss = nn.MSELoss(reduction='none')
loss = mse_loss(input, target)
loss = (loss * mask.float()).sum() # gives \sigma_euclidean over unmasked elements
non_zero_elements = mask.sum()
mse_loss_val = loss / non_zero_elements
# now doing backpropagation
mse_loss_val.backward()
This is only slightly worse than using .mean() if you are worried about numerical errors.

Bicoin price prediction using spark and scala [duplicate]

I am new to Apache Spark and trying to use the machine learning library to predict some data. My dataset right now is only about 350 points. Here are 7 of those points:
"365","4",41401.387,5330569
"364","3",51517.886,5946290
"363","2",55059.838,6097388
"362","1",43780.977,5304694
"361","7",46447.196,5471836
"360","6",50656.121,5849862
"359","5",44494.476,5460289
Here's my code:
def parsePoint(line):
split = map(sanitize, line.split(','))
rev = split.pop(-2)
return LabeledPoint(rev, split)
def sanitize(value):
return float(value.strip('"'))
parsedData = textFile.map(parsePoint)
model = LinearRegressionWithSGD.train(parsedData, iterations=10)
print model.predict(parsedData.first().features)
The prediction is something totally crazy, like -6.92840330273e+136. If I don't set iterations in train(), then I get nan as a result. What am I doing wrong? Is it my data set (the size of it, maybe?) or my configuration?
The problem is that LinearRegressionWithSGD uses stochastic gradient descent (SGD) to optimize the weight vector of your linear model. SGD is really sensitive to the provided stepSize which is used to update the intermediate solution.
What SGD does is to calculate the gradient g of the cost function given a sample of the input points and the current weights w. In order to update the weights w you go for a certain distance in the opposite direction of g. The distance is your step size s.
w(i+1) = w(i) - s * g
Since you're not providing an explicit step size value, MLlib assumes stepSize = 1. This seems to not work for your use case. I'd recommend you to try different step sizes, usually lower values, to see how LinearRegressionWithSGD behaves:
LinearRegressionWithSGD.train(parsedData, numIterartions = 10, stepSize = 0.001)

How do I actually execute a saved TensorFlow model?

Tensorflow newbie here. I'm trying to build an RNN. My input data is a set of vector instances of size instance_size representing the (x,y) positions of a set of particles at each time step. (Since the instances already have semantic content, they do not require an embedding.) The goal is to learn to predict the positions of the particles at the next step.
Following the RNN tutorial and slightly adapting the included RNN code, I create a model more or less like this (omitting some details):
inputs, self._input_data = tf.placeholder(tf.float32, [batch_size, num_steps, instance_size])
self._targets = tf.placeholder(tf.float32, [batch_size, num_steps, instance_size])
with tf.variable_scope("lstm_cell", reuse=True):
lstm_cell = tf.nn.rnn_cell.BasicLSTMCell(hidden_size, forget_bias=0.0)
if is_training and config.keep_prob < 1:
lstm_cell = tf.nn.rnn_cell.DropoutWrapper(
lstm_cell, output_keep_prob=config.keep_prob)
cell = tf.nn.rnn_cell.MultiRNNCell([lstm_cell] * config.num_layers)
self._initial_state = cell.zero_state(batch_size, tf.float32)
from tensorflow.models.rnn import rnn
inputs = [tf.squeeze(input_, [1])
for input_ in tf.split(1, num_steps, inputs)]
outputs, state = rnn.rnn(cell, inputs, initial_state=self._initial_state)
output = tf.reshape(tf.concat(1, outputs), [-1, hidden_size])
softmax_w = tf.get_variable("softmax_w", [hidden_size, instance_size])
softmax_b = tf.get_variable("softmax_b", [instance_size])
logits = tf.matmul(output, softmax_w) + softmax_b
loss = position_squared_error_loss(
tf.reshape(logits, [-1]),
tf.reshape(self._targets, [-1]),
)
self._cost = cost = tf.reduce_sum(loss) / batch_size
self._final_state = state
Then I create a saver = tf.train.Saver(), iterate over the data to train it using the given run_epoch() method, and write out the parameters with saver.save(). So far, so good.
But how do I actually use the trained model? The tutorial stops at this point. From the docs on tf.train.Saver.restore(), in order to read back in the variables, I need to either set up exactly the same graph I was running when I saved the variables out, or selectively restore particular variables. Either way, that means my new model will require inputs of size batch_size x num_steps x instance_size. However, all I want now is to do a single forward pass through the model on an input of size num_steps x instance_size and read out a single instance_size-sized result (the prediction for the next time step); in other words, I want to create a model that accepts a different-size tensor than the one I trained on. I can kludge it by passing the existing model my intended data batch_size times, but that doesn't seem like a best practice. What's the best way to do this?
You have to create a new graph that has the same structure but with the batch_size = 1 and import the saved variables with tf.train.Saver.restore(). You can take a look at how they define multiple models with variable batch size in ptb_word_lm.py: https://tensorflow.googlesource.com/tensorflow/+/master/tensorflow/models/rnn/ptb/ptb_word_lm.py
So you can have a separate file for instance, where you instantiate the graph with the batch_size that you want, then restore the saved variables. Then you can execute your graph.

How can GridSearchCV be used for clustering (MeanShift or DBSCAN)?

I'm trying to cluster some text documents using scikit-learn. I'm trying out both DBSCAN and MeanShift and want to determine which hyperparameters (e.g. bandwidth for MeanShift and eps for DBSCAN) best work for the kind of data I'm using (news articles).
I have some testing data which consists of pre-labeled clusters. I have been trying to use scikit-learn's GridSearchCV but don't understand how (or if it can) be applied in this case, since it needs the test data to be split, but I want to run the evaluation on the entire dataset and compare the results to the pre-labeled data.
I have been trying to specify a scoring function which compares the estimator's labels to the true labels, but of course it doesn't work because only a sample of the data has been clustered, not all of it.
What's an appropriate approach here?
The following function for DBSCAN might help. I've written it to iterate over the hyperparameters eps and min_samples and included optional arguments for min and max clusters. As DBSCAN is unsupervised, I have not included an evaluation parameter.
def dbscan_grid_search(X_data, lst, clst_count, eps_space = 0.5,
min_samples_space = 5, min_clust = 0, max_clust = 10):
"""
Performs a hyperparameter grid search for DBSCAN.
Parameters:
* X_data = data used to fit the DBSCAN instance
* lst = a list to store the results of the grid search
* clst_count = a list to store the number of non-whitespace clusters
* eps_space = the range values for the eps parameter
* min_samples_space = the range values for the min_samples parameter
* min_clust = the minimum number of clusters required after each search iteration in order for a result to be appended to the lst
* max_clust = the maximum number of clusters required after each search iteration in order for a result to be appended to the lst
Example:
# Loading Libraries
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
import pandas as pd
# Loading iris dataset
iris = datasets.load_iris()
X = iris.data[:, :]
y = iris.target
# Scaling X data
dbscan_scaler = StandardScaler()
dbscan_scaler.fit(X)
dbscan_X_scaled = dbscan_scaler.transform(X)
# Setting empty lists in global environment
dbscan_clusters = []
cluster_count = []
# Inputting function parameters
dbscan_grid_search(X_data = dbscan_X_scaled,
lst = dbscan_clusters,
clst_count = cluster_count
eps_space = pd.np.arange(0.1, 5, 0.1),
min_samples_space = pd.np.arange(1, 50, 1),
min_clust = 3,
max_clust = 6)
"""
# Importing counter to count the amount of data in each cluster
from collections import Counter
# Starting a tally of total iterations
n_iterations = 0
# Looping over each combination of hyperparameters
for eps_val in eps_space:
for samples_val in min_samples_space:
dbscan_grid = DBSCAN(eps = eps_val,
min_samples = samples_val)
# fit_transform
clusters = dbscan_grid.fit_predict(X = X_data)
# Counting the amount of data in each cluster
cluster_count = Counter(clusters)
# Saving the number of clusters
n_clusters = sum(abs(pd.np.unique(clusters))) - 1
# Increasing the iteration tally with each run of the loop
n_iterations += 1
# Appending the lst each time n_clusters criteria is reached
if n_clusters >= min_clust and n_clusters <= max_clust:
dbscan_clusters.append([eps_val,
samples_val,
n_clusters])
clst_count.append(cluster_count)
# Printing grid search summary information
print(f"""Search Complete. \nYour list is now of length {len(lst)}. """)
print(f"""Hyperparameter combinations checked: {n_iterations}. \n""")
Have you considered implementing the search yourself?
It's not particularly hard to implement a for loop. Even if you want to optimize two parameters it's still fairly easy.
For both DBSCAN and MeanShift I do however advise to first understand your similarity measure. It makes more sense to choose the parameters based on an understanding of your measure instead of parameter optimization to match some labels (which has a high risk of overfitting).
In other words, at which distance are two articles supposed to be clustered?
If this distance varies too much from one data point to another, these algorithms will fail badly; and you may need to find a normalized distance function such that the actual similarity values are meaningful again. TF-IDF is standard on text, but mostly in a retrieval context. They may work much worse in a clustering context.
Also beware that MeanShift (similar to k-means) needs to recompute coordinates - on text data, this may yield undesired results; where the updated coordinates actually got worse, instead of better.

A moving average with different functions and varying time-frames

I have a matrix time-series data for 8 variables with about 2500 points (~10 years of mon-fri) and would like to calculate the mean, variance, skewness and kurtosis on a 'moving average' basis.
Lets say frames = [100 252 504 756] - I would like calculate the four functions above on over each of the (time-)frames, on a daily basis - so the return for day 300 in the case with 100 day-frame, would be [mean variance skewness kurtosis] from the period day201-day300 (100 days in total)... and so on.
I know this means I would get an array output, and the the first frame number of days would be NaNs, but I can't figure out the required indexing to get this done...
This is an interesting question because I think the optimal solution is different for the mean than it is for the other sample statistics.
I've provided a simulation example below that you can work through.
First, choose some arbitrary parameters and simulate some data:
%#Set some arbitrary parameters
T = 100; N = 5;
WindowLength = 10;
%#Simulate some data
X = randn(T, N);
For the mean, use filter to obtain a moving average:
MeanMA = filter(ones(1, WindowLength) / WindowLength, 1, X);
MeanMA(1:WindowLength-1, :) = nan;
I had originally thought to solve this problem using conv as follows:
MeanMA = nan(T, N);
for n = 1:N
MeanMA(WindowLength:T, n) = conv(X(:, n), ones(WindowLength, 1), 'valid');
end
MeanMA = (1/WindowLength) * MeanMA;
But as #PhilGoddard pointed out in the comments, the filter approach avoids the need for the loop.
Also note that I've chosen to make the dates in the output matrix correspond to the dates in X so in later work you can use the same subscripts for both. Thus, the first WindowLength-1 observations in MeanMA will be nan.
For the variance, I can't see how to use either filter or conv or even a running sum to make things more efficient, so instead I perform the calculation manually at each iteration:
VarianceMA = nan(T, N);
for t = WindowLength:T
VarianceMA(t, :) = var(X(t-WindowLength+1:t, :));
end
We could speed things up slightly by exploiting the fact that we have already calculated the mean moving average. Simply replace the within loop line in the above with:
VarianceMA(t, :) = (1/(WindowLength-1)) * sum((bsxfun(#minus, X(t-WindowLength+1:t, :), MeanMA(t, :))).^2);
However, I doubt this will make much difference.
If anyone else can see a clever way to use filter or conv to get the moving window variance I'd be very interested to see it.
I leave the case of skewness and kurtosis to the OP, since they are essentially just the same as the variance example, but with the appropriate function.
A final point: if you were converting the above into a general function, you could pass in an anonymous function as one of the arguments, then you would have a moving average routine that works for arbitrary choice of transformations.
Final, final point: For a sequence of window lengths, simply loop over the entire code block for each window length.
I have managed to produce a solution, which only uses basic functions within MATLAB and can also be expanded to include other functions, (for finance: e.g. a moving Sharpe Ratio, or a moving Sortino Ratio). The code below shows this and contains hopefully sufficient commentary.
I am using a time series of Hedge Fund data, with ca. 10 years worth of daily returns (which were checked to be stationary - not shown in the code). Unfortunately I haven't got the corresponding dates in the example so the x-axis in the plots would be 'no. of days'.
% start by importing the data you need - here it is a selection out of an
% excel spreadsheet
returnsHF = xlsread('HFRXIndices_Final.xlsx','EquityHedgeMarketNeutral','D1:D2742');
% two years to be used for the moving average. (250 business days in one year)
window = 500;
% create zero-matrices to fill with the MA values at each point in time.
mean_avg = zeros(length(returnsHF)-window,1);
st_dev = zeros(length(returnsHF)-window,1);
skew = zeros(length(returnsHF)-window,1);
kurt = zeros(length(returnsHF)-window,1);
% Now work through the time-series with each of the functions (one can add
% any other functions required), assinging the values to the zero-matrices
for count = window:length(returnsHF)
% This is the most tricky part of the script, the indexing in this section
% The TwoYearReturn is what is shifted along one period at a time with the
% for-loop.
TwoYearReturn = returnsHF(count-window+1:count);
mean_avg(count-window+1) = mean(TwoYearReturn);
st_dev(count-window+1) = std(TwoYearReturn);
skew(count-window+1) = skewness(TwoYearReturn);
kurt(count-window +1) = kurtosis(TwoYearReturn);
end
% Plot the MAs
subplot(4,1,1), plot(mean_avg)
title('2yr mean')
subplot(4,1,2), plot(st_dev)
title('2yr stdv')
subplot(4,1,3), plot(skew)
title('2yr skewness')
subplot(4,1,4), plot(kurt)
title('2yr kurtosis')