how to obtain the trained best model from a crossvalidator - scala

I built a pipeline including a DecisionTreeClassifier(dt) like this
val pipeline = new Pipeline().setStages(Array(labelIndexer, featureIndexer, dt, labelConverter))
Then I used this pipeline as the estimator in a CrossValidator in order to get a model with the best set of hyperparameters like this
val c_v = new CrossValidator().setEstimator(pipeline).setEvaluator(new MulticlassClassificationEvaluator().setLabelCol("indexedLabel").setPredictionCol("prediction")).setEstimatorParamMaps(paramGrid).setNumFolds(5)
Finally, I could train a model on a training test with this crossvalidator
val model = c_v.fit(train)
But the question is, I want to view the best trained decision tree model with the parameter .toDebugTree of DecisionTreeClassificationModel. But model is a CrossValidatorModel. Yes, you can use model.bestModel, but it is still of type Model, you cannot apply .toDebugTree to it. And also I assume the bestModel is still a pipline including labelIndexer, featureIndexer, dt, labelConverter.
So does anyone know how I can obtain the decisionTree model from the model fitted by the crossvalidator, which I could view the actual model by toDebugString? Or is there any workaround that I can view the decisionTree model?

Well, in cases like this one the answer is always the same - be specific about the types.
First extract the pipeline model, since what you are trying to train is a Pipeline :
import org.apache.spark.ml.PipelineModel
val bestModel: Option[PipelineModel] = model.bestModel match {
case p: PipelineModel => Some(p)
case _ => None
}
Then you'll need to extract the model from the underlying stage. In your case it's a decision tree classification model :
import org.apache.spark.ml.classification.DecisionTreeClassificationModel
val treeModel: Option[DecisionTreeClassificationModel] = bestModel
flatMap {
_.stages.collect {
case t: DecisionTreeClassificationModel => t
}.headOption
}
To print the tree, for example :
treeModel.foreach(_.toDebugString)

(DISCLAIMER: There is another aspect, which imho deserves its own answer. I know it is a little OT given the question, however, it questions the question. If somebody down votes because he disagrees with the content please also leave a comment)
Should you extract the "best" tree and the answer is typically no.
Why are we doing CV? We are trying to evaluate our choices, to get. The choices are the classifiers used, hyper parameter used, preprocessing like feature selection. For the last one it is important that this happens on the training data. E.g., do not normalise the features on all data. So the output of CV is the pipeline generated. On a side note: the feature selection should evaluated on a "internal cv"
What we are not doing, we are not generating a "pool of classifiers" where we choose the best classifier. However, i've seen this surprisingly often. The problem is that you have an extremely high chance of a twining-effect. Even in a perfectly Iid dataset there are likely (near)duplicated training examples. There is a pretty good chance that the "best" CV classifier is just an indication in which fold you have the best twining.
Hence, what should you do? Once, you have fixed your parameters you should use the entire training data to build the final model. Hopefully, but nobody does this, you have set aside an additional evaluation set, which you have never touched in the process to get an evaluation of your final model.

Related

Doc2Vec Clustering with kmeans for a new document

I have a corpus trained with Doc2Vec as follows:
d2vmodel = Doc2Vec(vector_size=100, min_count=5, epochs=10)
d2vmodel.build_vocab(train_corpus)
d2vmodel.train(train_corpus, total_examples=d2vmodel.corpus_count, epochs=d2vmodel.epochs)
Using the vectors, the documents are clustered with kmeans:
kmeans_model = KMeans(n_clusters=NUM_CLUSTERS, init='k-means++', random_state = 42)
X = kmeans_model.fit(d2vmodel.docvecs.vectors_docs)
labels=kmeans_model.labels_.tolist()
I would like to use the k-means to cluster a new document and know which cluster it belongs to. I've tried the following but I don't think the input for predict is correct.
from numpy import array
testdocument = gensim.utils.simple_preprocess('Microsoft excel')
cluster_label = kmeans_model.predict(array(testdocument))
Any help is appreciated!
Your kmeans_model expects a features-vector similar to what it was provided during its original clustering – not the list-of-string-tokens you'll get back from gensim.simple_preprocess().
In fact, you want to use the Doc2Vec model to take such lists-of-tokens and turn them into model-compatible vectors, via its infer_vector() method. For example:
testdoc_words = gensim.utils.simple_preprocess('Microsoft excel')
testdoc_vector = d2vmodel.infer_vector(testdoc_words)
cluster_label = kmeans_model.predict(array(testdoc_vector))
Note that both Doc2Vec and inference work better on documents of at least tens-of-words long (not tiny 2-word phrases like your test here), and that inference may also often benefit from using a larger-than-default optional epochs parameter (especially on short documents).
Note also that your test document should be really preprocessed and tokenized exactly the same as your training data – so if some other process was used for preparing train_corpus, use that same process for post-training documents. (Words not recognized by the Doc2Vec model, because they weren't present during training, will be silently ignored – so an error like doing a different style of case-flattening at inference time will weaken results a lot.)

How to define the callback in Deep Learning?

Would you please guide me what the philosophy of Callbacks is? In other words, how can I define them for each unique problem?
I assume you are taking about keras. So callbacks are used to execute code during training. Everything is explained quite well here
Here an example on how to log loss during training:
class LossHistory(Callback):
def on_train_begin(self, logs={}):
self.losses = []
def on_batch_end(self, batch, logs={}):
self.losses.append(logs.get('loss'))
print logs.get('loss')
model.fit(data, data, epochs=50, batch_size=72, validation_data=(data, data), verbose=0, shuffle=False,callbacks=[LossHistory()])

Pause JModelica and Pass Incremental Inputs During Simulation

Hi Modelica Community,
I would like to run two models in parallel in JModelica but I'm not sure how to pass variables between the models. One model is a python model and the other is an EnergyPlusToFMU model.
The examples in the JModelica documentation has the full simulation period inputs defined prior to the simulation of the model. I don't understand how one would configure a model that pauses for inputs, which is a key feature of FMUs and co-simulation.
Can someone provide me with an example or piece of code that shows how this could be implemented in JModelica?
Do I put the simulate command in a loop? If so, how do I handle warm up periods and initialization without losing data at prior timesteps?
Thank you for your time,
Justin
Late answer, but in case it is picked up by others...
You can indeed put the simulation into a loop, you just need to keep track of the state of your system, such that you can re-init it at every iteration. Consider the following example:
Ts = 100
x_k = x_0
for k in range(100):
# Do whatever you need to get your input here
u_k = ...
FMU.reset()
FMU.set(x_k.keys(), x_k.values())
sim_res = FMU.simulate(
start_time=k*Ts,
final_time=(k+1)*Ts,
input=u_k
)
x_k = get_state(sim_res)
Now, I have written a small function to grab the state, x_k, of the system:
# Get state names and their values at given index
def get_state(fmu, results, index):
# Identify states as variables with a _start_ value
identifier = "_start_"
keys = fmu.get_model_variables(filter=identifier + "*").keys()
# Now, loop through all states, get their value and put it in x
x = {}
for name in keys:
x[name] = results[name[len(identifier):]][index]
# Return state
return x
This relies on setting "state_initial_equations": True compile option.

Training Sparks word2vec with a RDD[String]

I'm new to Spark and Scala so I might have misunderstood some basic things here. I'm trying to train Sparks word2vec model on my own data. According to their documentation, one way to do this is
val input = sc.textFile("text8").map(line => line.split(" ").toSeq)
val word2vec = new Word2Vec()
val model = word2vec.fit(input)
The text8 dataset contains one line of many words, meaning that input will become an RDD[Seq[String]].
After massaging my own dataset, which has one word per line, using different maps etc. I'm left with an RDD[String], but I can't seem to be able to train the word2vec model on it. I tried doing input.map(v => Seq(v)) which does actually give an RDD[Seq[String]], but that will give one sequence for each word, which I guess is totally wrong.
How can I wrap a sequence around my strings, or is there something else I have missed?
EDIT
So I kind of figured it out. From my clean being an RDD[String] I do val input = sc.parallelize(Seq(clean.collect().toSeq)). This gives me the correct data structure (RDD[Seq[String]]) to fit the word2vec model. However, running collect on a large dataset gives me out of memory error. I'm not quite sure how they intend the fitting to be done? Maybe it is not really parallelizable. Or maybe I'm supposed to have several semi-long sequences of strings inside and RDD, instead of one long sequence like I have now?
It seems that the documentation is updated in an other location (even though I was looking at the "latest" docs). New docs are at: https://spark.apache.org/docs/latest/ml-features.html
The new example drops the text8 example file alltogether. I'm doubting whether the original example ever worked as intended. The RDD input to word2vec should be a set of lists of strings, typically sentences or otherwise constructed n-grams.
Example included for other lost souls:
val documentDF = sqlContext.createDataFrame(Seq(
"Hi I heard about Spark".split(" "),
"I wish Java could use case classes".split(" "),
"Logistic regression models are neat".split(" ")
).map(Tuple1.apply)).toDF("text")
// Learn a mapping from words to Vectors.
val word2Vec = new Word2Vec()
.setInputCol("text")
.setOutputCol("result")
.setVectorSize(3)
.setMinCount(0)
val model = word2Vec.fit(documentDF)
Why not
input.map(v => v.split(" "))
or whatever would be an appropriate delimiter to split your words on. This will give you the desired sequence of strings - but with valid words.
As I can recall, word2vec in ml take dataframe as argument and word2vec in mllib can take rdd as argument. The example you posted is for word2vec in ml. Here is the official guide: https://spark.apache.org/docs/latest/mllib-feature-extraction.html#word2vec

Functional implementation of Tarjan's Strongly Connected Components algorithm

I went ahead and implemented the textbook version of Tarjan's SCC algorithm in Scala. However, I dislike the code - it is very imperative/procedural with lots of mutating states and book-keeping indices. Is there a more "functional" version of the algorithm? I believe imperative versions of algorithms hide the core ideas behind the algorithm unlike the functional versions. I found someone else encountering the same problem with this particular algorithm but I have not been able to translate his Clojure code into idomatic Scala.
Note: If anyone wants to experiment, I have a good setup that generates random graphs and tests your SCC algorithm vs running Floyd-Warshall
See Lazy Depth-First Search and Linear Graph Algorithms in Haskell by David King and John Launchbury. It describes many graph algorithms in a functional style, including SCC.
The following functional Scala code generates a map that assigns a representative to each node of a graph. Each representative identifies one strongly connected component. The code is based on Tarjan's algorithm for strongly connected components.
In order to understand the algorithm it might suffice to understand the fold and the contract of the dfs function.
def scc[T](graph:Map[T,Set[T]]): Map[T,T] = {
//`dfs` finds all strongly connected components below `node`
//`path` holds the the depth for all nodes above the current one
//'sccs' holds the representatives found so far; the accumulator
def dfs(node: T, path: Map[T,Int], sccs: Map[T,T]): Map[T,T] = {
//returns the earliest encountered node of both arguments
//for the case both aren't on the path, `old` is returned
def shallowerNode(old: T,candidate: T): T =
(path.get(old),path.get(candidate)) match {
case (_,None) => old
case (None,_) => candidate
case (Some(dOld),Some(dCand)) => if(dCand < dOld) candidate else old
}
//handle the child nodes
val children: Set[T] = graph(node)
//the initially known shallowest back-link is `node` itself
val (newState,shallowestBackNode) = children.foldLeft((sccs,node)){
case ((foldedSCCs,shallowest),child) =>
if(path.contains(child))
(foldedSCCs, shallowerNode(shallowest,child))
else {
val sccWithChildData = dfs(child,path + (node -> path.size),foldedSCCs)
val shallowestForChild = sccWithChildData(child)
(sccWithChildData, shallowerNode(shallowest, shallowestForChild))
}
}
newState + (node -> shallowestBackNode)
}
//run the above function, so every node gets visited
graph.keys.foldLeft(Map[T,T]()){ case (sccs,nextNode) =>
if(sccs.contains(nextNode))
sccs
else
dfs(nextNode,Map(),sccs)
}
}
I've tested the code only on the example graph found on the Wikipedia page.
Difference to imperative version
In contrast to the original implementation, my version avoids explicitly unwinding the stack and simply uses a proper (non tail-) recursive function. The stack is represented by a persistent map called path instead. In my first version I used a List as stack; but this was less efficient since it had to be searched for containing elements.
Efficiency
The code is rather efficient. For each edge, you have to update and/or access the immutable map path, which costs O(log|N|), for a total of O(|E| log|N|). This is in contrast to O(|E|) achieved by the imperative version.
Linear Time implementation
The paper in Chris Okasaki's answer gives a linear time solution in Haskell for finding strongly connected components. Their implementation is based on Kosaraju's Algorithm for finding SCCs, which basically requires two depth-first traversals. The paper's main contribution appears to be a lazy, linear time DFS implementation in Haskell.
What they require to achieve a linear time solution is having a set with O(1) singleton add and membership test. This is basically the same problem that makes the solution given in this answer have a higher complexity than the imperative solution. They solve it with state-threads in Haskell, which can also be done in Scala (see Scalaz). So if one is willing to make the code rather complicated, it is possible to implement Tarjan's SCC algorithm to a functional O(|E|) version.
Have a look at https://github.com/jordanlewis/data.union-find, a Clojure implementation of the algorithm. It's sorta disguised as a data structure, but the algorithm is all there. And it's purely functional, of course.