Usage of indicator functions as features in Sequential Models - mallet

I am currently using Mallet for training a sequential model using CRF. I have understood how to provide features (that solely depend on input sequence) to the mallet package. Based on my understanding, in mallet, we have to compute all the values of the feature functions (upfront). Now, I would like to use indicator functions that depend on the label of a token. The value of these functions depends on the output label sequence and during training, I can compute the values of these indicator functions as the output label sequence is known. But, when I am applying this trained CRF model on a new input (whose output label sequene is unknown), how should I calculate the values for such features.
It will be very helpful to me if anyone can provide me any tips/relevant documents.

As you've phrased it, the question doesn't make sense: if you don't know the hidden labels, you can't set anything based on those unknown labels. An example might help.
You may not need to explicitly record these relationships. At training time the algorithm sets the parameters of the CRF to represent the relationship between the observed features and the unobserved state. Different CRF architectures can allow you to add dependencies between multiple hidden states.

Related

Can I separately train a classifier (e.g. SVM) with two different types of features and combine the results later?

I am a student and working on my first simple machine learning project. The project is about classifying articles into fake and true. I want to use SVM as classification algorithm and two different types of features:
TF-IDF
Lexical Features like the count of exclamation marks and numbers
I have figured out how to use the lexical features and TF-IDF as a features separately. However, I have not managed to figure out, how to combine them.
Is it possible, to train and test two separate learning algorithms (one with TF-IDF and the other one with lexical features) and later combine the results?
For example, can I calculate Accuracy, Precision and Recall for both separately and then take the average?
One way of combining two models is called model stacking. The idea behind it is, that you take the predictions of both models and feed them into a third model (called meta-model) which is then trained to make predictions given the output of the first two models. There is also another version of model stacking where you aditionally feed the original features into the meta-model.
However, in your case another way to combine both approaches would be to simply feed both the TF-IDF and the lexical features into one model and see how that performs.
For example, can I calculate Accuracy, Precision and Recall for both separately and then take the average?
This would unfortunately not work, because there is no combined model making those predictions for which your calculated metrics would be true.

extreme architecture of neuron network

i have quick simple question about neuron network. As we all know, it is better to make the network deeper instead of wider. So what will happen if i set each hidden layer to be just one neuron and expand my network really deep?
This question is raised because i had a lecture about CNN today. The reason why we use CNN is that we want to extract the features of images and decrease the dimensions of the input data. Since we keep making the input for each layer smaller and smaller, why not just use one neuron and make network deeper? Or something bad will happen?
thanks!
Obviously, the single-neuron example doesn't work -- otherwise, that's what we'd use.
The kernels of each layer in a CNN harness spatial relationships and evaluate those juxtapositions with non-linear functions, which are the main differentiation of a CNN over a simple linear-combination NN. Without those non-linear operations, the CNN layers are merely a programming convenience.
If you immediately collapse your input to a single value, you have a huge problem in trying to write the cascading non-linearities that comprise the output evaluation. Yes, it's theoretically possible to write a function with, say, 28x28x3 inputs and exactly the output you want -- and to train the multitiude of parameters in that function -- but it's very messy to code and nearly impossible to maintain.
For instance, imagine trying to code an entire income tax form in a single function, such that the input was the entire range of applicable monetary amounts, Boolean status info, scalar arguments (e.g. how many children are live at home), ... and have the output be the correct amount of your payment or refund. There are many Boolean equations to apply, step functions with changing tax rates, various categories of additional tax or relief, etc.
Now, parameterize all of the constant coefficients in that massive calculation. Get some 10^6 real-life observations, and train your model on only that input and labels (tax/refund amount). You don't get any intermediate results to check, only that final amount.
It's possible, but not easy to write, debug, or maintain.

What is an appropriate value of the parameter "Size" in nnet function in R?

I read somewhere that is should be which.is.max of the nnet model. Is there a rule of thumb to define the value for Size ?
Unfortunately, a single appropriate size hyperparameter does not exist. This value (as well as weight decay) depends on the data and the application at hand. Cross-validation procedures may provide you with decent values for a specific dataset. You should try Random search or Grid Search, which are two basic (yet effective) approaches to this problem. I also recommend you to check this thread about how to choose the number of hidden layers and nodes in a feedforward neural network.

Is Cross Validation enough to ensure that there is no Overfitting in a classification algorithm?

I have a data set with 45 observations for one class and 55 observations for another class. Moreover, I am using 4 different features which were previously chosen by using a Feature Selection filter though the results of this procedure were somewhat strange..
On the other hand, I am using cross validation and getting good accuracy results (75% to 85%) from different classifiers since I'm using the classificationLearner on Matlab. Would this ensure that there is no overfitting? Or there might still be a chance for this? How can I assure that there is no overfitting?
That really depends on your training data set that you have available. If data that is available to you isn't representative enough, you will not get a good model regardless of the methods you use for training and validation.
Having that in mind, if you are sure your data is representative (has the same distribution of values for any subset of "important" attributes as the the global set of all data) than cross validation is good enough to rely on.

Torch implementation of multi-output-layer neural network

I am going to build a neural network which has an architecture of more than one output layers. More specificly, it is designed to construct parallel procedures on top of a series of convolutional layers. One branch is to compute classification results (softmax-like); the other is to get regression results. However, I'm stuck designing the model as well as choosing loss functions(criterions).
I. Should I use torch container nn.Parallel() or nn.Concat() for the branch layers on top of conv layers (nn.Sequential())? What is the differenct except for data format.
II. Due to output data, a classification loss function and a regression loss function are to be combined linearly. I am wondering whether nn.MultiCriterion() or nn.ParallelCriterion() to be chosen with respect to determined container. Or I have to customize a new criterion class.
III. Could anyone who had done similar work tell me if torch needs additional customization to implement backprop for training. I concern about data structure issue of torch containers.
Concat vs Parallel differ in that each module in Concat gets the entire output of the last layer as input, while each input of Parallel takes a slice of the output of the last layer. For your purpose you need Concat, not Parallel, since both loss functions need to take the entire output of your sequential network.
Based on the source code of MultiCriterion and ParallenCriterion they do practically the same thing. The important difference is that in case of MultiCriterion you provide multiple loss functions, but only one target, and they are all computed against that target. Given that you have a classification and a regression task I assume you have different targets, so you need ParallelCriterion(false), where false enables the multitarget mode (if the argument is true ParallelCriterion seems to behave identical to MultiCriterion). Then the target is expected to be a table of targets for individual criterions.
If you use Concat and ParallelCriterion, torch should be able to compute gradients properly for you. The both implement updateGradInput, which properly merges the gradients of individual blocks.