Question About Stacked BiLSTM (Multi-layer BiLSTM) - neural-network

I noticed that the single-layer BiLSTM comprises of two independent (unidirectional) LSTM layers, one for forward direction and the other for reverse. Then the outputs of two LSTM layers will be concatenated together, along with the hidden dim (always dim = -1), to form the output of this BiLSTM layer.
And for a multi-layer model, each inner layer accepts the output of the previous layer, and then outputs the calculation results to the next layer.
So far, there is no ambiguity.
But for a multi-layer BiLSTM, I found some ambiguity. Since each BiLSTM layer has two independent LSTM, I don't know the correct input the inner layer accepts.
The concatenated output from the previous layer? (If that true, this means the input_size of the inner LSTM layer, no matter left-to-right or right-to-left, is 2 * (hidden_size of the previous layer)) (see this implementation) (and see this picture from: Illustrating the use of two BiLSTMs for Semantic Role Labelling. Source: He et al. 2017, fig. 1.)
Or treat multi-layer BiLSTM as two unidirectional multi-layer LSTM (one for left_to_right and another for right_to_left), each unidirectional LSTM only accepts only the outputs from previous layers. Then, after two multi-layer unidirectional LSTM calculations finished, we will concatenate the left-to-right and right-to-left outputs of each layer to form the output of each BiLSTM layer? (see this picture from: Arrhythmia Classification in Multi-Channel ECG Signals Using Deep Neural Networks: Kim. 2018, fig. 3.2))

Related

Keras explanation: number of nodes in input layer

I'm trying to understand the relationship between a simple Perceptron and a neural network one gets when using the keras Sequence class.
I learned that the neural network perceptron looks as such:
Each "node" in the first layer is one of the features of a sample x_1, x_2,...,x_n
Could somebody explain the jump to the neural network I find in the Keras package below?
Since the input layer has four nodes, does that mean that network consists of four of the perceptron networks?
There is seem to be misunderstanding on what a perceptron is. A perceptron is a single unit that multiplies the inputs with weights, sums them up and applies an activation function:
Now the diagrams you have are called multi-layer perceptrons (MLP) and consist of a stack of perceptrons organised in layers, wiki. In Keras, there is no explicit notion of a perceptron but of a layer of perceptrons implemented as a Dense layer because the layers are densely connected, ie every output is connected to every input between layers. The second diagram would correspond to:
model = Sequential()
model.add(Dense(4, activation='sigmoid', input_dim=3))
model.add(Dense(4, activation='sigmoid'))
model.add(Dense(1, activation='sigmoid'))
assuming you have sigmoid activation. In this case, the input layer is implicit by specifying the input_dim=3 and the final layer would be the output layer.

units of neural network layer are independent?

In neural network, there are 3 main parts defined as input layer, hidden layer and output layer. Is there any correlation between units of hidden layer? For example, for 1st and 2nd neurons of hidden layer are independent of each other, or there is a relation between each other? Is there any source that explains this issue?
Answer depends on many factors. From probabilistic perspective they are independent given inputs and before training. If input is not fixed then they are heavily correlated (as two "almost linear" functions of the same input signal). Finally, after training they will be strongly correlated, and exact correlations will depend on initialisation and training itself.

Neural network categorization: Do they always have to have one label per training data

In all the examples of categorization with neural networks that I have seen, they all have training data that has one category as the predominant category or the label for each input data.
Can you feed training data that has more than one label. Eg: a picture with a "cat" and a "mouse".
I understand (maybe wrong) that if you use softmax for probability/prediction at the output layer, it tends to try and select one (maximize discerning power). I'm guessing this would hurt/prevent learning and predicting multiple labels with input data.
Is there any approach/architecture of NN where there are multiple labels in training data and multiple outputs predictions are made ? or is that already the case and I missed some vital understanding. Please clarify.
Most examples have one class per input, so no you haven't missed anything. It is however possible to do multi-class classification, which is sometimes called joint classification in the literature.
The naive implementation you suggested with a softmax will struggle as the outputs on the final layer have to add up to 1, so the more classes you have the harder it is to figure out what the network is trying to say.
You can change the architecture to achieve what you want however. For each class you could have a binary softmax classifier which branches off from the penultimate layer or you can use a sigmoid, which doesn't have to add up to one (even though each neuron outputs between 0 and 1). Note using a sigmoid might make training more difficult.
Alternatively you could train multiple networks for each class and then combine them into one classification system at the end. It depends on how complex your envisioned task is.
Is there any approach/architecture of NN where there are multiple labels in training data and multiple outputs predictions are made ?
Answer is YES. To briefly answer your question, I am giving an example in the context of Keras, a high-level neural network library.
Let's consider the following model. We want to predict how many retweets and likes a news headline will receive on Twitter. The main input to the model will be the headline itself, as a sequence of words, but to spice things up, our model will also have an auxiliary input, receiving extra data such as the time of day when the headline was posted, etc.
from keras.layers import Input, Embedding, LSTM, Dense, merge
from keras.models import Model
# headline input: meant to receive sequences of 100 integers, between 1 and 10000.
# note that we can name any layer by passing it a "name" argument.
main_input = Input(shape=(100,), dtype='int32', name='main_input')
# this embedding layer will encode the input sequence
# into a sequence of dense 512-dimensional vectors.
x = Embedding(output_dim=512, input_dim=10000, input_length=100)(main_input)
# a LSTM will transform the vector sequence into a single vector,
# containing information about the entire sequence
lstm_out = LSTM(32)(x)
auxiliary_output = Dense(1, activation='sigmoid', name='aux_output')(lstm_out)
auxiliary_input = Input(shape=(5,), name='aux_input')
x = merge([lstm_out, auxiliary_input], mode='concat')
# we stack a deep fully-connected network on top
x = Dense(64, activation='relu')(x)
x = Dense(64, activation='relu')(x)
x = Dense(64, activation='relu')(x)
# and finally we add the main logistic regression layer
main_output = Dense(1, activation='sigmoid', name='main_output')(x)
This defines a model with two inputs and two outputs:
model = Model(input=[main_input, auxiliary_input], output=[main_output, auxiliary_output])
Now, lets compile and train the model as follows:
model.compile(optimizer='rmsprop',
loss={'main_output': 'binary_crossentropy', 'aux_output': 'binary_crossentropy'},
loss_weights={'main_output': 1., 'aux_output': 0.2})
# and trained it via:
model.fit({'main_input': headline_data, 'aux_input': additional_data},
{'main_output': labels, 'aux_output': labels},
nb_epoch=50, batch_size=32)
Reference: Multi-input and multi-output models in Keras

Can a convolutional neural network be built with perceptrons?

I was reading this interesting article on convolutional neural networks. It showed this image, explaining that for every receptive field of 5x5 pixels/neurons, a value for a hidden value is calculated.
We can think of max-pooling as a way for the network to ask whether a given feature is found anywhere in a region of the image. It then throws away the exact positional information.
So max-pooling is applied.
With multiple convolutional layers, it looks something like this:
But my question is, this whole architecture could be build with perceptrons, right?
For every convolutional layer, one perceptron is needed, with layers:
input_size = 5x5;
hidden_size = 10; e.g.
output_size = 1;
Then for every receptive field in the original image, the 5x5 area is inputted into a perceptron to output the value of a neuron in the hidden layer. So basically doing this for every receptive field:
So the same perceptron is used 24x24 amount of times to construct the hidden layer, because:
is that we're going to use the same weights and bias for each of the 24×24 hidden neurons.
And this works for the hidden layer to the pooling layer as well, input_size = 2x2; output_size = 1;. And in the case of a max-pool layer, it's just a max() function on an array.
and then finally:
The final layer of connections in the network is a fully-connected
layer. That is, this layer connects every neuron from the max-pooled
layer to every one of the 10 output neurons.
which is a perceptron again.
So my final architecture looks like this:
-> 1 perceptron for every convolutional layer/feature map
-> run this perceptron for every receptive field to create feature map
-> 1 perceptron for every pooling layer
-> run this perceptron for every field in the feature map to create a pooling layer
-> finally input the values of the pooling layer in a regular ALL to ALL perceptron
Or am I overseeing something? Or is this already how they are programmed?
The answer very much depends on what exactly you call a Perceptron. Common options are:
Complete architecture. Then no, simply because it's by definition a different NN.
A model of a single neuron, specifically y = 1 if (w.x + b) > 0 else 0, where x is the input of the neuron, w and b are its trainable parameters and w.b denotes the dot product. Then yes, you can force a bunch of these perceptrons to share weights and call it a CNN. You'll find variants of this idea being used in binary neural networks.
A training algorithm, typically associated with the Perceptron architecture. This would make no sense to the question, because the learning algorithm is in principle orthogonal to the architecture. Though you cannot really use the Perceptron algorithm for anything with hidden layers, which would suggest no as the answer in this case.
Loss function associated with the original Perceptron. This notion of Peceptron is orthogonal to the problem at hand, you're loss function with a CNN is given by whatever you try to do with your whole model. You can eventually use it, but it is non-differentiable, so good luck :-)
A sidenote rant: You can see people refer to feed-forward, fully-connected NNs with hidden layers as "Multilayer Perceptrons" (MLPs). This is a misnomer, there are no Perceptrons in MLPs, see e.g. this discussion on Wikipedia -- unless you go explore some really weird ideas. It would make sense call these networks as Multilayer Linear Logistic Regression, because that's what they used to be composed of. Up till like 6 years ago.

Interpret the output of neural network in matlab

I have build a neural network model, with 3 classes. I understand that the best output for a classification process is the boolean 1 for a class and boolean zeros for the other classes , for example the best classification result for a certain class, where the output of a classifire that lead on how much this data are belong to this class is the first element in a vector is [1 , 0 , 0]. But the output of the testing data will not be like that,instead it will be a rational numbers like [2.4 ,-1 , .6] ,So how to interpret this result? How to decide to which class the testing data belong?
I have tried to take the absolute value and turn the maximum element to 1 and the other to zeros, so is this correct?
Learner.
It appears your neural network is bad designed.
Regardless your structure is -number of input-hidden-output- layers, when you are doing a multiple classification problem, you must ensure each of your output neurones are evaluating an individual class, that is, each them has a bounded output, in this case, between 0 and 1. Use almost any of the defined function on the output layer for performing this.
Nevertheles, for the Neural Network to work properly, you must strongly remember, that every single neuron loop -from input to output- operates as a classificator, this is, they define a region on your input space which is going to be classified.
Under this framework, every single neuron has a direct interpretable sense on the non-linear expansion the NN is defining, particularly when there are few hidden layers. This is ensured by the general expression of Neural Networks:
Y_out=F_n(Y_n-1*w_n-t_n)
...
Y_1=F_0(Y_in-1*w_0-t_0)
For example, with radial basis neurons -i.e. F_n=sqrt(sum(Yni-Rni)^2) and w_n=1 (identity):
Yn+1=sqrt(sum(Yni-Rni)^2)
a dn-dim spherical -being dn the dimension of the n-1 layer- clusters classification is induced from the first layer. Similarly, elliptical clusters are induced. When two radial basis neuron layers are added under that structure of spherical/elliptical clusters, unions and intersections of spherical/elliptical clusters are induced, three layers are unions and intersections of the previous, and so on.
When using linear neurons -i.e. F_n=(.) (identity), linear classificators are induced, that is, the input space is divided by dn-dim hyperplanes, and when adding two layers, union and intersections of hyperplanes are induced, three layers are unions and intersections of the previous, and so on.
Hence, you can realize the number of neurons per layer is the number of classificators per each class. So if the geometry of the space is -lets put this really graphically- two clusters for the class A, one cluster for the class B and three clusters for the class C, you will need at least six neurons per layer. Thus, assuming you could expect anything, you can consider as a very rough approximate, about n neurons per class per layer, that is, n neurons to n^2 minumum neurons per class per layer. This number can be increased or decreased according the topology of the classification.
Finally, the best advice here is for n outputs (classes), r inputs:
Have r good classificator neurons on the first layers, radial or linear, for segmenting the space according your expectations,
Have n to n^2 neurons per layer, or as per the dificulty of your problem,
Have 2-3 layers, only increase this number after getting clear results,
Have n thresholding networks on the last layer, only one layer, as a continuous function from 0 to 1 (make the crisp on the code)
Cheers...