What is the correct svmlight input format in Mallet? - classification

I am using Mallet with the SVMLight input format to do classification usingNaiveBayes classifier. But I get a NumberFormatException. I'm wondering how I can use strings features when using SVMLight. As I read in the guideline 1, the features can also be strings.
Can anyone help me what is wrong with my code or input?
Here is my code:
public void trainMalletNaiveBayes() throws Exception {
ArrayList<Pipe> pipes = new ArrayList<Pipe>();
pipes.add(new SvmLight2FeatureVectorAndLabel());
pipes.add(new PrintInputAndTarget());
SerialPipes pipe = new SerialPipes(pipes);
//prepare training instances
InstanceList trainingInstanceList = new InstanceList(pipe);
trainingInstanceList.addThruPipe(new CsvIterator(new FileReader("/tmp/featureFiles_svm.csv"), "^(\\S*)[\\s,]*(.*)$", 2, 1, -1));
//prepare test instances
InstanceList testingInstanceList = new InstanceList(pipe);
testingInstanceList.addThruPipe(new CsvIterator(new FileReader("/tmp/test_set.csv"), "^(\\S*)[\\s,]*(.*)$", 2, 1, -1));
ClassifierTrainer trainer = new NaiveBayesTrainer();
Classifier classifier = trainer.train(trainingInstanceList);
And here is the first three lines of my input file:
No f1:NP f2:NN f3:1 f4:1 f5:0 f6:0 f7:0 f8:0.0 f9:1 f10:true f11:false f12:false f13:false f14:false f15:ROOT f16:NN f17:NOTHING
No f1:NP f2:NN f3:8 f4:4 f5:0 f6:0 f7:1 f8:4.127134385045092 f9:8 f10:true f11:false f12:false f13:false f14:false f15:ROOT f16:DT f17:NOTHING
Yes f1:NP f2:NN f3:4 f4:3 f5:0 f6:0 f7:0 f8:0.0 f9:4 f10:true f11:false f12:false f13:false f14:false f15:NP f16:DT f17:NN
The first column is the label of the instance and there rest of the data includes the features and their values. For example, NN shows the POS of the head word of a phrase.
In the meantime, I get the exception for the NN (NumberFormatException: For input string: "NN") . I'm wondering why it doesn't have any problem with the NP which comes before that, but stops at the NN.

All features need to have numeric values. For booleans you can use true=1 and false=0. You would also have to modify f1:NP to f1_NP=1.
The reason it's not dying on the NP is that the SvmLight2FeatureVectorAndLabel class is expecting to parse an entire line (label and data), but the code is reading the file with a CsvIterator that is splitting off the first element as a label.
The classify.tui.SvmLight2Vectors class uses this code for an iterator:
new SelectiveFileLineIterator (fileReader, "^\\s*#.+")

Related

How to predict the outcome variables using a saved pipeline when the data set does not contain the actual outcome?

I have a data set that contains the following columns: outcome (this is the outcome that we want to predict), and raw (a column that consists of text). I want to develop an ML model that will predict the outcome from the raw column. I have trained an ML model in Databricks using the following pipeline:
regexTokenizer = RegexTokenizer(inputCol="raw", outputCol="words", pattern="\\W")
countVec = CountVectorizer(inputCol="words", outputCol="features")
indexer = StringIndexer(inputCol="outcome", outputCol="label").setHandleInvalid("skip").fit(trainDF)
inverter = IndexToString(inputCol="prediction", outputCol="prediction_label", labels=indexer.labels)
nb = NaiveBayes(labelCol="label", featuresCol="features", smoothing=1.0, modelType="multinomial")
pipeline = Pipeline(stages=[regexTokenizer, indexer, countVec, nb, inverter])
model = pipeline.fit(trainDF)
model.write().overwrite().save("/FileStore/project")
In another notebook, I load the model and try to predict the values for a new data set. This data set does not contain the outcome variable ("outcome" in this case):
model = PipelineModel.load("/FileStore/project")
score_output_df = model.transform(score_this)
When I try to predict the values for the new data set, I get an error message that the column "outcome" cannot be found. I suspect that this is due to the fact that some stages in the pipeline transform this column (the indexer and inverter stages are used to convert the outcome column to numbers and then back to string labels.).
My question is this, how can I load a saved model and use it to predict values when the original pipeline contains stages that have this column as an input.
instead of using
model.write().overwrite().save("/FileStore/project")
you have to write it like this
model.write().overwrite().save("/FileStore/project/model.sav")
and then for loading you will use this
model = PipelineModel.load("/FileStore/project/model.sav")
score_output_df = model.transform(score_this)
I have found a solution to the problem and will post it here so that if someone faces the same problem they can benefit from it. The solution was simply to extract the stages that I want to use in the prediction and save them to the model as such:
model = PipelineModel.load("/FileStore/project")
stages1 = []
stages1 += [model.stages[0]]
stages1 += [model.stages[2]]
stages1 += [model.stages[3]]
stages1 += [model.stages[4]]
model.stages = stages1
score_output_df = model.transform(score_this)
In this code, I exclude the second step ([1]) because it contains the indexer. Once I do this, I can predict values when the "outcome" column is not available.

Inputs to Encoder-Decoder LSTMCell/RNN Network

I'm creating an LSTM Encoder-Decoder Network, using Keras, following the code provided here: https://github.com/LukeTonin/keras-seq-2-seq-signal-prediction. The only change I made is to replace the GRUCell with an LSTMCell. Basically both the encoder and decoder consists of 2 layers, of 35 LSTMCells. The layers are stacked over (and combined with) each other using an RNN Layer.
The LSTMCell returns 2 states whereas the GRUCell returns 1 state. This is where I am encountering an error, as I do not know how to code for the 2 returned states of the LSTMCell.
I have created two models: first, an encoder-decoder model. Second, a prediction model. I am not encountering any problems in the encoder-decoder model, but a encountering problems in the decoder of the prediction model.
The error I am getting is:
ValueError: Layer rnn_4 expects 9 inputs, but it received 3 input tensors. Input received: [<tf.Tensor 'input_4:0' shape=(?, ?, 1) dtype=float32>, <tf.Tensor 'input_11:0' shape=(?, 35) dtype=float32>, <tf.Tensor 'input_12:0' shape=(?, 35) dtype=float32>]
This error happens when this line below, in the prediction model, is run:
decoder_outputs_and_states = decoder(
decoder_inputs, initial_state=decoder_states_inputs)
The section of code this fits into is:
encoder_predict_model = keras.models.Model(encoder_inputs,
encoder_states)
decoder_states_inputs = []
# Read layers backwards to fit the format of initial_state
# For some reason, the states of the model are order backwards (state of the first layer at the end of the list)
# If instead of a GRU you were using an LSTM Cell, you would have to append two Input tensors since the LSTM has 2 states.
for hidden_neurons in layers[::-1]:
# One state for GRU, but two states for LSTMCell
decoder_states_inputs.append(keras.layers.Input(shape=(hidden_neurons,)))
decoder_outputs_and_states = decoder(
decoder_inputs, initial_state=decoder_states_inputs)
decoder_outputs = decoder_outputs_and_states[0]
decoder_states = decoder_outputs_and_states[1:]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_predict_model = keras.models.Model(
[decoder_inputs] + decoder_states_inputs,
[decoder_outputs] + decoder_states)
Could somebody help me with the for loop above, and initial states I should be passing the decoder after that?
I had an similar error and i solved just doing what he says, adding another input tensor:
# If instead of a GRU you were using an LSTM Cell, you would have to append two Input tensors since the LSTM has 2 states.
for hidden_neurons in layers[::-1]:
# One state for GRU
decoder_states_inputs.append(keras.layers.Input(shape=(hidden_neurons,)))
decoder_states_inputs.append(keras.layers.Input(shape=(hidden_neurons,)))
here it solved the prolem...

How to apply word2vec for k-means clustering?

I am new to word2vec. With applying this method, I am trying to form some clusters based on words extracted by word2vec from scientific publications' abstracts. To this end, I have first retrieved sentences from the abstracts via stanfordNLP and put each sentence into a line in a text file. Then the text file required by deeplearning4j word2vec was ready to process (http://deeplearning4j.org/word2vec).
Since the texts come from scientific fields, there are a lot of mathematical terms or brackets. See the sample sentences below:
The meta-analysis showed statistically significant effects of pharmacopuncture compared to conventional treatment = 3.55 , P = .31 , I-2 = 16 % ) .
90 asymptomatic hypertensive subjects associated with LVH , DM , or RI were randomized to receive D&G herbal capsules 1 gm/day , 2 gm/day , or identical placebo capsules in double-blind and parallel fashion for 12 months .
After preparing the text file, I have run word2vec as below:
SentenceIterator iter = new LineSentenceIterator(new File(".../filename.txt"));
iter.setPreProcessor(new SentencePreProcessor() {
#Override
public String preProcess(String sentence) {
//System.out.println(sentence.toLowerCase());
return sentence.toLowerCase();
}
});
// Split on white spaces in the line to get words
TokenizerFactory t = new DefaultTokenizerFactory();
t.setTokenPreProcessor(new CommonPreprocessor());
log.info("Building model....");
Word2Vec vec = new Word2Vec.Builder()
.minWordFrequency(5)
.iterations(1)
.layerSize(100)
.seed(42)
.windowSize(5)
.iterate(iter)
.tokenizerFactory(t)
.build();
log.info("Fitting Word2Vec model....");
vec.fit();
log.info("Writing word vectors to text file....");
// Write word vectors
WordVectorSerializer.writeWordVectors(vec, "abs_terms.txt");
This script creates a text file containing many words withe their related vector values in each row as below:
pills -4.559159278869629E-4 0.028691953048110008 0.023867368698120117 ...
tricuspidata -0.00431067543104291 -0.012515762820839882 0.0074045853689312935 ...
As a subsequent step, this text file has been used to form some clusters via k-means in spark. See the code below:
val rawData = sc.textFile("...abs_terms.txt")
val extractedFeatureVector = rawData.map(s => Vectors.dense(s.split(' ').slice(2,101).map(_.toDouble))).cache()
val numberOfClusters = 10
val numberOfInterations = 100
//We use KMeans object provided by MLLib to run
val modell = KMeans.train(extractedFeatureVector, numberOfClusters, numberOfInterations)
modell.clusterCenters.foreach(println)
//Get cluster index for each buyer Id
val AltCompByCluster = rawData.map {
row=>
(modell.predict(Vectors.dense(row.split(' ').slice(2,101)
.map(_.toDouble))),row.split(',').slice(0,1).head)
}
AltCompByCluster.foreach(println)
As a result of the latest scala code above, I have retrieved 10 clusters based on the word vectors suggested by word2vec. However, when I have checked my clusters no obvious common words appeared. That is, I could not get reasonable clusters as I expected. Based on this bottleneck of mine I have a few questions:
1) From some tutorials for word2vec I have seen that no data cleaning is made. In other words, prepositions etc. are left in the text. So how should I apply cleaning procedure when applying word2vec?
2) How can I visualize the clustering results in a explanatory way?
3) Can I use word2vec word vectors as input to neural networks? If so which neural network (convolutional, recursive, recurrent) method would be more suitable for my goal?
4) Is word2vec meaningful for my goal?
Thanks in advance.

XMeans ELKI fails at every third input file

I'm trying to cluster image data (stored in 100 separate csv files) with ELKI's XMeans algorithm. It works well for the first two files, but then the algorithm hangs on forever while processing the third file. It looks like the problem occurs at every 3rd file or so, because when I start the loop, that goes over all files at the fourth file, it works for the fourth and the fifth file, but not for the sixth file. Same goes for the 9th and 11th file... but maybe that's coincidence.
My XMeans call looks like this:
DatabaseConnection dbc = new ArrayAdapterDatabaseConnection(data);
Database db = new StaticArrayDatabase(dbc, null);
db.initialize();
Relation<NumberVector> rel = db.getRelation(TypeUtil.NUMBER_VECTOR_FIELD);
DBIDRange ids = (DBIDRange) rel.getDBIDs();
SquaredEuclideanDistanceFunction dist = SquaredEuclideanDistanceFunction.STATIC;
RandomlyGeneratedInitialMeans init = new RandomlyGeneratedInitialMeans(RandomFactory.DEFAULT);
KMeansInitialization initializer = new FirstKInitialMeans();
PredefinedInitialMeans splitInitializer = new PredefinedInitialMeans(data);
KMeansQualityMeasure informationCriterion = new WithinClusterMeanDistanceQualityMeasure();
RandomFactory random = new RandomFactory(123);
KMeans<NumberVector, KMeansModel> innerKMeans = new KMeansHamerly<>(dist, 50, 1, init, true);
XMeans<NumberVector, KMeansModel> xm = new XMeans<>(dist, 5, 50, 1, innerKMeans, initializer, splitInitializer, informationCriterion, random);
Clustering<KMeansModel> c = xm.run(db, rel);
I'm not too sure about these four lines, so maybe that's why it works for some files and for others it doesn't:
KMeansInitialization initializer = new FirstKInitialMeans();
PredefinedInitialMeans splitInitializer = new PredefinedInitialMeans(data);
KMeansQualityMeasure informationCriterion = new WithinClusterMeanDistanceQualityMeasure();
RandomFactory random = new RandomFactory(123);
data is just a double[][] which contains the data from the input files.
Any help would be very appreciated!
Please, use the Parameterization API to configure X-means.
Because of the nested k-means, it is very easy to configure things badly.
The initializer of the inner k-means class must be set to this:
PredefinedInitialMeans splitInitializer = new PredefinedInitialMeans((double[][]) null);
KMeans<NumberVector, KMeansModel> innerKMeans = new KMeansHamerly<>(dist, 50, 1, splitInitializer, true);
because otherwise X-means currently cannot control the initialization of the inner algorithm. I will remove this parameter, and have XMeans set the initializer of the inner algorithm.
Without a stack trace (as mentioned by #Anony-Mousse) it is hard to say what is happening. My best guess is that this meta-algorithm (an algorithm that runs another algorithm!) is not correctly configured and maybe chooses bad initialial values?

PyBrain how to interpret the results from net.activate?

I've trained a network on PyBrain for purpose of classification and am ready to fire away with specific input. However, when I do
classes = ['apple', 'orange', 'peach', 'banana']
data = ClassificationDataSet(len(input), 1, nb_classes=len(classes), class_labels=classes)
data._convertToOneOfMany( ) # recommended by PyBrain
fnn = buildNetwork( data.indim, 5, data.outdim, outclass=SoftmaxLayer )
trainer = BackpropTrainer( fnn, dataset=data, momentum=m, verbose=True, weightdecay=wd)
trainer.trainUntilConvergence(maxEpochs=80)
# stop training and start using my trained network here
output = fnn.activate(input)
As expected, I get a numeric value for "output", but is there a way to determine the predicted class label directly? Even if there's not one, how can I map the value of "output" to my class label? Thank you for your help.
When you say you get a numeric value for "output" do you mean a scalar (that is, not an array)? From my understanding of it, you should have gotten an array of four values (ie. as many as possible output classes you have). The biggest value in that array corresponds to the index of the class. I don't know if PyBrain provides an utility function to extract that, but you can do it like this:
class_index = max(xrange(len(output)), key=output.__getitem__)
class_name = classes[class_index]
Incidentally, you omitted the step in which you actually fill the data in the dataset.