Doc2Vec Clustering with kmeans for a new document - cluster-analysis

I have a corpus trained with Doc2Vec as follows:
d2vmodel = Doc2Vec(vector_size=100, min_count=5, epochs=10)
d2vmodel.build_vocab(train_corpus)
d2vmodel.train(train_corpus, total_examples=d2vmodel.corpus_count, epochs=d2vmodel.epochs)
Using the vectors, the documents are clustered with kmeans:
kmeans_model = KMeans(n_clusters=NUM_CLUSTERS, init='k-means++', random_state = 42)
X = kmeans_model.fit(d2vmodel.docvecs.vectors_docs)
labels=kmeans_model.labels_.tolist()
I would like to use the k-means to cluster a new document and know which cluster it belongs to. I've tried the following but I don't think the input for predict is correct.
from numpy import array
testdocument = gensim.utils.simple_preprocess('Microsoft excel')
cluster_label = kmeans_model.predict(array(testdocument))
Any help is appreciated!

Your kmeans_model expects a features-vector similar to what it was provided during its original clustering – not the list-of-string-tokens you'll get back from gensim.simple_preprocess().
In fact, you want to use the Doc2Vec model to take such lists-of-tokens and turn them into model-compatible vectors, via its infer_vector() method. For example:
testdoc_words = gensim.utils.simple_preprocess('Microsoft excel')
testdoc_vector = d2vmodel.infer_vector(testdoc_words)
cluster_label = kmeans_model.predict(array(testdoc_vector))
Note that both Doc2Vec and inference work better on documents of at least tens-of-words long (not tiny 2-word phrases like your test here), and that inference may also often benefit from using a larger-than-default optional epochs parameter (especially on short documents).
Note also that your test document should be really preprocessed and tokenized exactly the same as your training data – so if some other process was used for preparing train_corpus, use that same process for post-training documents. (Words not recognized by the Doc2Vec model, because they weren't present during training, will be silently ignored – so an error like doing a different style of case-flattening at inference time will weaken results a lot.)

Related

Numba signature for structured arrays

Numba's documentation does not give any example of signatures for functions that take structured arrays. I have tried several ways, but all were rejected by Numba (and Pylance).
import numba as nb
import numpy as np
PairSpec = [("x", np.float32), ("y", np.float32)]
Pair = np.dtype(PairSpec)
NumbaPair = nb.from_dtype(Pair)
# BUG None of this works
# #nb.jit(np.float32(Pair[:]))
# #nb.jit(np.float32(NumbaPair[:]))
#nb.jit
def sum(pairs):
pair = pairs[0]
return pair.x + pair.y
pairs = np.array([(2, 3)], dtype=PairSpec)
print(sum(pairs))
How to give a signature to a function that takes structured arrays?
The correct signature is nb.float32(NumbaPair[:]). Note the use of nb.float32 and not np.float32. Also please note that arrays of structures (AoS) generally tend to be less efficient than structures of arrays (SoA). This is especially true for coordinates since most fields are generally read and AoS prevent any efficient vectorization (while modern x86-64 processors can typically compute ~16 float32 values per cycle and per core, as opposed to 2 for scalar values).

Pause JModelica and Pass Incremental Inputs During Simulation

Hi Modelica Community,
I would like to run two models in parallel in JModelica but I'm not sure how to pass variables between the models. One model is a python model and the other is an EnergyPlusToFMU model.
The examples in the JModelica documentation has the full simulation period inputs defined prior to the simulation of the model. I don't understand how one would configure a model that pauses for inputs, which is a key feature of FMUs and co-simulation.
Can someone provide me with an example or piece of code that shows how this could be implemented in JModelica?
Do I put the simulate command in a loop? If so, how do I handle warm up periods and initialization without losing data at prior timesteps?
Thank you for your time,
Justin
Late answer, but in case it is picked up by others...
You can indeed put the simulation into a loop, you just need to keep track of the state of your system, such that you can re-init it at every iteration. Consider the following example:
Ts = 100
x_k = x_0
for k in range(100):
# Do whatever you need to get your input here
u_k = ...
FMU.reset()
FMU.set(x_k.keys(), x_k.values())
sim_res = FMU.simulate(
start_time=k*Ts,
final_time=(k+1)*Ts,
input=u_k
)
x_k = get_state(sim_res)
Now, I have written a small function to grab the state, x_k, of the system:
# Get state names and their values at given index
def get_state(fmu, results, index):
# Identify states as variables with a _start_ value
identifier = "_start_"
keys = fmu.get_model_variables(filter=identifier + "*").keys()
# Now, loop through all states, get their value and put it in x
x = {}
for name in keys:
x[name] = results[name[len(identifier):]][index]
# Return state
return x
This relies on setting "state_initial_equations": True compile option.

Training Sparks word2vec with a RDD[String]

I'm new to Spark and Scala so I might have misunderstood some basic things here. I'm trying to train Sparks word2vec model on my own data. According to their documentation, one way to do this is
val input = sc.textFile("text8").map(line => line.split(" ").toSeq)
val word2vec = new Word2Vec()
val model = word2vec.fit(input)
The text8 dataset contains one line of many words, meaning that input will become an RDD[Seq[String]].
After massaging my own dataset, which has one word per line, using different maps etc. I'm left with an RDD[String], but I can't seem to be able to train the word2vec model on it. I tried doing input.map(v => Seq(v)) which does actually give an RDD[Seq[String]], but that will give one sequence for each word, which I guess is totally wrong.
How can I wrap a sequence around my strings, or is there something else I have missed?
EDIT
So I kind of figured it out. From my clean being an RDD[String] I do val input = sc.parallelize(Seq(clean.collect().toSeq)). This gives me the correct data structure (RDD[Seq[String]]) to fit the word2vec model. However, running collect on a large dataset gives me out of memory error. I'm not quite sure how they intend the fitting to be done? Maybe it is not really parallelizable. Or maybe I'm supposed to have several semi-long sequences of strings inside and RDD, instead of one long sequence like I have now?
It seems that the documentation is updated in an other location (even though I was looking at the "latest" docs). New docs are at: https://spark.apache.org/docs/latest/ml-features.html
The new example drops the text8 example file alltogether. I'm doubting whether the original example ever worked as intended. The RDD input to word2vec should be a set of lists of strings, typically sentences or otherwise constructed n-grams.
Example included for other lost souls:
val documentDF = sqlContext.createDataFrame(Seq(
"Hi I heard about Spark".split(" "),
"I wish Java could use case classes".split(" "),
"Logistic regression models are neat".split(" ")
).map(Tuple1.apply)).toDF("text")
// Learn a mapping from words to Vectors.
val word2Vec = new Word2Vec()
.setInputCol("text")
.setOutputCol("result")
.setVectorSize(3)
.setMinCount(0)
val model = word2Vec.fit(documentDF)
Why not
input.map(v => v.split(" "))
or whatever would be an appropriate delimiter to split your words on. This will give you the desired sequence of strings - but with valid words.
As I can recall, word2vec in ml take dataframe as argument and word2vec in mllib can take rdd as argument. The example you posted is for word2vec in ml. Here is the official guide: https://spark.apache.org/docs/latest/mllib-feature-extraction.html#word2vec

how to obtain the trained best model from a crossvalidator

I built a pipeline including a DecisionTreeClassifier(dt) like this
val pipeline = new Pipeline().setStages(Array(labelIndexer, featureIndexer, dt, labelConverter))
Then I used this pipeline as the estimator in a CrossValidator in order to get a model with the best set of hyperparameters like this
val c_v = new CrossValidator().setEstimator(pipeline).setEvaluator(new MulticlassClassificationEvaluator().setLabelCol("indexedLabel").setPredictionCol("prediction")).setEstimatorParamMaps(paramGrid).setNumFolds(5)
Finally, I could train a model on a training test with this crossvalidator
val model = c_v.fit(train)
But the question is, I want to view the best trained decision tree model with the parameter .toDebugTree of DecisionTreeClassificationModel. But model is a CrossValidatorModel. Yes, you can use model.bestModel, but it is still of type Model, you cannot apply .toDebugTree to it. And also I assume the bestModel is still a pipline including labelIndexer, featureIndexer, dt, labelConverter.
So does anyone know how I can obtain the decisionTree model from the model fitted by the crossvalidator, which I could view the actual model by toDebugString? Or is there any workaround that I can view the decisionTree model?
Well, in cases like this one the answer is always the same - be specific about the types.
First extract the pipeline model, since what you are trying to train is a Pipeline :
import org.apache.spark.ml.PipelineModel
val bestModel: Option[PipelineModel] = model.bestModel match {
case p: PipelineModel => Some(p)
case _ => None
}
Then you'll need to extract the model from the underlying stage. In your case it's a decision tree classification model :
import org.apache.spark.ml.classification.DecisionTreeClassificationModel
val treeModel: Option[DecisionTreeClassificationModel] = bestModel
flatMap {
_.stages.collect {
case t: DecisionTreeClassificationModel => t
}.headOption
}
To print the tree, for example :
treeModel.foreach(_.toDebugString)
(DISCLAIMER: There is another aspect, which imho deserves its own answer. I know it is a little OT given the question, however, it questions the question. If somebody down votes because he disagrees with the content please also leave a comment)
Should you extract the "best" tree and the answer is typically no.
Why are we doing CV? We are trying to evaluate our choices, to get. The choices are the classifiers used, hyper parameter used, preprocessing like feature selection. For the last one it is important that this happens on the training data. E.g., do not normalise the features on all data. So the output of CV is the pipeline generated. On a side note: the feature selection should evaluated on a "internal cv"
What we are not doing, we are not generating a "pool of classifiers" where we choose the best classifier. However, i've seen this surprisingly often. The problem is that you have an extremely high chance of a twining-effect. Even in a perfectly Iid dataset there are likely (near)duplicated training examples. There is a pretty good chance that the "best" CV classifier is just an indication in which fold you have the best twining.
Hence, what should you do? Once, you have fixed your parameters you should use the entire training data to build the final model. Hopefully, but nobody does this, you have set aside an additional evaluation set, which you have never touched in the process to get an evaluation of your final model.

What does the . operator do in matlab?

I came across some matlab code that did the following:
thing.x=linspace(...
I know that usually the . operator makes the next operation elementwise, but what does it do by itself? Is this just a sub-object operator, like in C++?
Yes its subobject.
You can have things like
Roger.lastname = "Poodle";
Roger.SSID = 111234997;
Roger.children.boys = {"Jim", "John"};
Roger.children.girls = {"Lucy"};
And the things to the right of the dots are called fields.
You can also define classes in Matlab, instatiate objects of those classes, and then if thing was one of those objects, thing.x would be an instance variable in that object.
The matlab documentation is excellent, look up "fields" and "classes" in it.
There are other uses for ., M*N means multiploy two things, if M, N are both matrices, this implements the rules for matrix multiplication to get a new matrix as its result. But M.*N means, if M, N are same shape, multiply each element. And so no like that with more subtleties, but out of scope of what you asked here.
As #marc points out, dot is also used to reference fields and subfields of something matlab calls a struct or structure. These are a lot like classes, subclasses and enums, seems to me. The idea is you can have a struct data say, and store all the info that goes with data like this:
olddata = data; % we assume we have an old struct like the one we are creating, we keep a reference to it
data.date_created=date();
data.x_axis = [1 5 2 9];
data.notes = "This is just a trivial example for stackoverflow. I didn't check to see if it runs in matlab or not, my bad."
data.versions.current = "this one";
data.versions.previous = olddata;
The point is ANY matlab object/datatype/whatever you want to call it, can be referenced by a field in the struct. The last entry shows that we can even reference another struct in the field of a struct. The implication of this last bit is we could look at the date of creation of the previous verions:
data.versions.previous.date_created
To me this looks just like objects in java EXCEPT I haven't put any methods in there. Matlab does support java objects which to me look a lot like these structs, except some of the fields can reference functions.
Technically, it's a form of indexing, as per mwengler's answer. However, it can also be used for method invocation on objects in recent versions of MATLAB, i.e.
obj.methodCall;
However note that there is some inefficiency in that style - basically, the system has to first work out if you meant indexing into a field, and if not, then call the method. It's more efficient to do
methodCall(obj);