Having a combination of pre trained and supervised embeddings in rasa nlu pipeline - chatbot

I am new to rasa and started creating a very domain-specific chatbot. As part of it, I understand its better to use supervised embeddings as part of nlu pipeline, since my use case is domain-specific.
I have an example intent in my nlu.md
## create_system_and_config
- create a [VM](system) of [12 GB](config)
If I try to use a supervised featurizer, it might be working fine with my domain-specific entities, but my concern here is, by using only supervised learning, won't we lose the advantage of pre-trained models? For example, in a query such as add a (some_system) of (some_config). add and create are very closely related. pre-trained models will be able to pick such verbs easily. Is it possible to have a combination of pre-trained model and then do some supervised learning on top of it in our nlu pipeline, something like transfer learning?

If you're creating domain-specific chatbot, it's always better to use supervised embedding instead of pre-trained
For example, in general English, the word “balance” is closely related
to “symmetry”, but very different to the word “cash”. In a banking
domain, “balance” and “cash” are closely related and you’d like your
model to capture that.
In your case also
your model needs to capture that words VM and Virtual Machine are same. Pretrained featurizers are not trained to capture this and they are more generic.
The advantage of using pre-trained word embeddings in your pipeline is
that if you have a training example like: “I want to buy apples”, and
Rasa is asked to predict the intent for “get pears”, your model
already knows that the words “apples” and “pears” are very similar.
This is especially useful if you don’t have enough training data
For more details you can refer Rasa document

Related

How to implement Featuretools into my ML Process?

I am exploring the possibility of implementing Featuretools into my pipeline, to be able to create new features from my Df.
Currently I am using a GridSearchCV, with a Pipeline embedded inside it. Since Featuretools is creating new features with aggregation on columns, like STD(column) etc, I feel like it is suspectible to data leakage. In their FAQ, they are giving an example approach to tackle it, which is not suitable for a Pipeline structure I am using.
Idea 0: I would love to integrate it directly into my Pipeline but it seems like not compatible with Pipelines. It would use fold train data to construct features, transform fold test data. K times. At the end, it would use whole data to construct, during Refit= True stage of GridSearchCV. If you have any example opposed to this fact, you are very welcome.
Idea 1: I can switch to a manual CV structure, not embedded into pipeline. And inside it, I can use Train data to construct new features, and test data to transform with these. It will work K times. At the end, all data can be used to construct Ultimate model.
It is the safest option, with time and complexity disadvantages.
Idea 2: Using it with whole data, ignore the leakage possibility. I am not in favor of this of course. But when I look at Project Github page, all the examples are combining Train and Test data, creating these features with whole data. Then go on with Train-Test division for modeling.
https://github.com/Featuretools/predict-taxi-trip-duration/blob/master/NYC%20Taxi%203%20-%20Simple%20Featuretools.ipynb
Actually if the developers of the project think like that, I could give it a chance with whole data.
What do you think, I would love to hear about your experiences on FeatureTools.

What is the use of "domain" attribute when we create a custom translation model

After reading the docs regarding creation of custom translation model it is not clear what is the use of the domain attribute when we create a IBM Cloud Translation custom translation model ?
As stated in the documentation here
Most of the provided translation models in Language Translator can be
extended to learn custom terms and phrases or a general style that's
derived from your translation data. Follow these instructions to
create your own custom translation model.
Use a parallel corpus when you want your custom model to learn from
general translation patterns in parallel sentences in your samples.
What your model learns from a parallel corpus can improve translation
results for input text that the model has not been trained on. You can
upload multiple parallel corpora files with a request. To successfully
train with parallel corpora, the corpora files must contain a
cumulative total of at least 5000 parallel sentences. The cumulative
size of all uploaded corpus files for a custom model is limited to 250
MB.
Check the customizing your model documentation for more info.
Make sure that your Language Translator service instance is on an Advanced or Premium pricing plan.

What is currently the best way to add a custom dictionary to a neural machine translator that uses the transformer architecture?

It's common to add a custom dictionary to a machine translator to ensure that terminology from a specific domain is correctly translated. For example, the term server should be translated differently when the document is about data centers, vs when the document is about restaurants.
With a transformer model, this is not very obvious to do, since words are not aligned 1:1. I've seen a couple of papers on this topic, but I'm not sure which would be the best one to use. What are the best practices for this problem?
I am afraid you cannot easily do that. You cannot easily add new words to the vocabulary because you don't know what embedding it would get during training. You can try to remove some words, or alternatively you can manually change the bias in the final softmax layer to prevent some words from appearing in the translation. Anything else would be pretty difficult to do.
What you want to do is called domain adaptation. To get an idea of how domain adaptation is usually done, you can have a look at a survey paper.
The most commonly used approaches are probably model finetuning or ensembling with a language model. If you want to have parallel data in your domain, you can try to fine-tune your model on that parallel data (with simple SGD, small learning rate).
If you only have monolingual data in the target language, you train a language model on that data. During the decoding, you can mix the probabilities from the domain-specific language and the translation model. Unfortunately, I don't know of any tool that could do this out of the box.

Convert PySpark ML Word2Vec model to Gensim Word2Vec model

I've generated a PySpark Word2Vec model like so:
from pyspark.ml.feature import Word2Vec
w2v = Word2Vec(vectorSize=100, minCount=1, inputCol='words', outputCol = 'vector')
model = w2v.fit(df)
(The data that I used to train the model on isn't relevant, what's important is that its all in the right format and successfully yields a pyspark.ml.feature.Word2VecModel object.)
Now I need to convert this model to a Gensim Word2Vec model. How would I go about this?
If you still have the training data, re-training the gensim Word2Vec model may be the most straightforward approach.
If you only need the word-vectors, perhaps PySpark's model can export them in the word2vec.c format that gensim can load with .load_word2vec_format().
The only reason to port the model would be to continue training. Such incremental training, while possible, involves considering a lot of tradeoffs in balancing the influence of the older and later training to get good results.
If you are in fact wanting to do this conversion in order to do more training in such a manner, it again suggests that using the original training to reproduce a similar model could be plausible.
But, if you have to convert the model, the general approach would be to study the source code and internal data structures of the two models, to discover how they alternatively represent each of the key aspects of the model:
the known word-vectors (model.wv.vectors in gensim)
the known-vocabulary of words, including stats about word-frequencies and the position of individual words (model.wv.vocab in gensim)
the hidden-to-output weights of the model (`model.trainables' and its properties in gensim)
other model properties describing the model's modes & metaparameters
A reasonable interactive approach could be:
Write some acceptance tests that take models of both types, and test whether they are truly 'equivalent' for your purposes. (This is relatively easy for just checking if the vectors for individual words are present and identical, but nearly as hard as the conversion itself for verifying other ready-to-be-trained-more behaviors.)
Then, in an interactive notebook, load the source model, and also create a dummy gensim model with the same vocabulary size. Consulting the source code, write Python statements to iteratively copy/transform key properties over from the source into the target, repeatedly testing if they verify as equivalent.
When they do, take those steps you did manually and combine them into a utility method to do the conversion. Again verify its operation then try using the converted model however you'd hoped – perhaps discovering overlooked info or discovering other bugs in the process, and then improving the verification method and conversion method.
It's possible that the PySpark model will be missing things the gensim model expects, which might require synthesizing workable replacement values.
Good luck! (But re-train the gensim model from the original data if you want things to just be straightforward and work.)

What is the Best Test Automation Approach for WatiN

I Studied both data-driven and keyword driven approaches. After reading, It seems data driven is better than keyword. For documentation purpose keyword sounds great. But it has many levels. I need guidance from people who actually have implemented Automation frameworks. Personally, I want to store all data in database or excel and break up the system into modular parts (functions that are common to major company products).
Currently using, WatiN, Nunit, CC.net
Any advise pls
I would hightly recommend that you look into the stack that Michael Hunter aka the braidy tester built for testing expression at Microsoft he has a lot of articles about it http://www.thebraidytester.com/stack.html
Esentially he splits out into a logical model, a physical model and a data model and all three are loosley copupled. All my stacks are written this way now. So the test cases end up looking like this:
Logical.Google.Search.Websearch("watin");
Verification.VerifySearchResult("watin");
All the test data is then stored in a sql express database that indexed by the text string, in this case watin.
You will need to build a full domain model and data access layer, I personally auto generate that using SubSonic.