Deploy Knowledge Studio dictionary pre-annotator to Natural Language Understanding - annotations

I'm getting started with Knowledge Studio and Natural Language Understanding.
I'm able to deploy a machine-learning model toNatural Language Understanding and use the API to query it.
I would know if there's a way to deploy only the pre-annotator.
I read from Knowledge Studio's documentation that
You can deploy or export a machine-learning annotator. A dictionary pre-annotator can only be used to pre-annotate documents within Watson Knowledge Studio.
Does exist a workaround to create a model that simply does the job of the pre-annotator, i.e. use dictionaries to find entities instead of the machine-learning model?

Does exist a workaround to create a model that simply does the job of the pre-annotator, i.e. use dictionaries to find entities instead of the machine-learning model?
You may need to explain this better in what you need.
WKS allows you to pre-annotate documents with dictionaries you upload. Once you have created a ML model, you can alternatively use that to annotate your training documents, and then manually correct. As you continue the amount of manual work will reduce after each model iteration.
The assumption is that you are creating a model with a reasonable amount of examples. In your model results, you will want the mention/relations to be outside or close to outside the gray area of the report.
The other interpretation of your request I took was you want to create a dictionary based model only. This is possible using the "Rule-Based Model" functionality. You would have to create the parsing rules but you just map what you want to find to the dictionary/rule.
Using this in production though is still limited. You should get a warning when you deploy these kinds of models.
It's slightly better than just a keyword search as you can map items to parts of speech.
The last point. The purpose of WKS is to create a machine learning model which will do the work in discovering new terms you haven't seen before. With the rule based engine it can only find what you explicitly tell it to find.
If all you want is just dictionary entries, then you can create a very simple string comparison solution, but you lose the linguistic features.

Related

How to implement Featuretools into my ML Process?

I am exploring the possibility of implementing Featuretools into my pipeline, to be able to create new features from my Df.
Currently I am using a GridSearchCV, with a Pipeline embedded inside it. Since Featuretools is creating new features with aggregation on columns, like STD(column) etc, I feel like it is suspectible to data leakage. In their FAQ, they are giving an example approach to tackle it, which is not suitable for a Pipeline structure I am using.
Idea 0: I would love to integrate it directly into my Pipeline but it seems like not compatible with Pipelines. It would use fold train data to construct features, transform fold test data. K times. At the end, it would use whole data to construct, during Refit= True stage of GridSearchCV. If you have any example opposed to this fact, you are very welcome.
Idea 1: I can switch to a manual CV structure, not embedded into pipeline. And inside it, I can use Train data to construct new features, and test data to transform with these. It will work K times. At the end, all data can be used to construct Ultimate model.
It is the safest option, with time and complexity disadvantages.
Idea 2: Using it with whole data, ignore the leakage possibility. I am not in favor of this of course. But when I look at Project Github page, all the examples are combining Train and Test data, creating these features with whole data. Then go on with Train-Test division for modeling.
https://github.com/Featuretools/predict-taxi-trip-duration/blob/master/NYC%20Taxi%203%20-%20Simple%20Featuretools.ipynb
Actually if the developers of the project think like that, I could give it a chance with whole data.
What do you think, I would love to hear about your experiences on FeatureTools.

Embedding version control of the weaviate

In Weaviate, the vector engine, I wonder how this can handle version issue of embedding model.
For instance, considering the (trained) word2vec model, embedded vectors from different models must be seperated.
One option might think is that make distinct multiple classes representing model version.
Custom script may useful. If new model available, create new class and import accorded data. After that, change (GET) entrypoints (used for searching nearest vectors) to the new class.
Or maybe weaviate have other fancy way to handle this issue, but I couldn't find.
As at version 1.17.3, you have to manage this yourself because weaviate only supports one embedding per object.
There is a feature request to allow multiple embeddings per object here. But it sounds like your request is closer to this one. In any case, have a look at them and upvote the one that addresses your need so the engineering team can prioritize accordingly. Also, feel free to raise a new feature request if neither of these addresses your needs.

Are Operational Transformation Frameworks only meant for text?

Looking at all the examples of Operational Transformation Frameworks out there, they all seem to resolve around the transformation of changes to plain text documents. How would an OT framework be used for more complex objects?
I'm wanting to dev a real-time sticky notes style app, where people can co-create sticky notes, change their positon and text value. Would I be right in assuming that the position values wouldn't be transformed? (I mean, how would they, you can't merge them right?). However, I would want to use an OT framework to resolve conflicts with the posit-its value, correct?
I do not see any problem to use Operational Transformation to work with Complex Objects, what you need is to define what operations your OT system support and how concurrency is solved for them
For instance, if you receive two Sticky notes "coordinates move operation" from two different users from same 'client state', you need to make both states to converge, probably cancelling out second operation.
This is exactly the same behaviour with text when two users generate two updates to delete a text range that overlaps completely, (or maybe partially), the second update processed must be transformed against the previous and the resultant operation will only effectively delete a portion of the original one, (or completely cancelled with a 'no-op')
You can take a look on this nice explanation about how Google Wave Operational Transformation works and guess from this point how it should work your own implementation
See the following paper for an approach to using OT with trees if you want to go down that route:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.100.74
However, in your particular case, I would use a separate plain text OT document for each stickynote and use an existing library, eg: etherPad, to do the heavy lifting. The positions of the notes could then be broadcast on a last-committer-wins basis.
Operation Transformation is a general technique, it works for any data type. The point is you need to define your transformation functions. Also, there are some atomic attributes that you cannot merge automatically like (position and background color) those will be mostly "last-update wins" or the user solves them manually when there is a conflict.
there are some nice libs and frameworks that provide OT for complex data already out there:
ShareJS : library for Node which provides all operations on JSON objects
DerbyJS: framework for NodeJS, it uses ShareJS for OT stuff.
Open Coweb framework : Dojo foundation project for cooperative web applications using OT

In salesforce.com can you have multivalued attributes?

I am developing a Novell Identity Manager driver for Salesforce.com, and am trying to understand the Salesforce.com platform better.
I have had really good success to date. I can read pretty much arbitrary object classes out of SFDC, and create eDirectory objects for them, and what not. This is all done and working nicely. (Publisher Channel). Once I got Query events mapped out, most everything started working in the Publisher Channel.
I am now working on sending events back to SFDC (Subscriber channel) when changes occur in eDirectory.
I am using the upsert() function in the SOAP API, and with Novell Identity Manager, you basically build the SOAP doc, and can see the results as you build it. (You can do it in XSLT or you can use the various allowed tokens to build the document in DirXML Script. I am using DirXML Script which has been working well so far.).
The upshot of that comment is that I can build the SOAP document, see it, to be sure I get it right. Which is usually different than the Java/C++ approach that the sample code usually provides. Much more visual this way.
There are several things about upsert() that I do not entirely understand. I know how to blank a value, should I get that sort of event. Inside the <urn:sObjects> node, add a node like (assuming you get your namespaces declared already):
<urn1:fieldsToNull>FieldName</urn1:fieldsToNull>
I know how to add a value (AttrValue) to the attribute (FieldName), add a node like:
<FieldName>AttrValue</FieldName>
All this works and is pretty straight forward.
The question I have is, can a value in SFDC be multi-valued? In eDirectory, a multi valued attribute being changed, can happen two ways:
All values can be removed, and the new set re-added.
The single value removed can be sent as that sort of event (remove-value) or many values can be removed in one operation.
Looking at SFDC, I only ever see Multi-picklist attributes that seem to be stored in a single entry : or ; delimited. Is there another kind of multi valued attribute managed differently in SFDC? And if so, how would one manipulate it via the SOAP API?
I still have to decide if I want to map those multi-picklists to a single string, or a multi valued attribute of strings. First way is easier, second way is more useful... Hmmm... Choices...
Some references:
I have been using the page Sample SOAP messages to understand what the docs should look like.
Apex Explorer is a kicking tool for browsing the database and testing queries. Much like DBVisualizer does for JDBC connected databases. This would have been so much harder without it!
SoapUi is also required, and a lovely tool!
As far as I know there's no multi-value field other than multi-select picklists (and they map to semicolon-separated string). Generally platform encourages you to create a proper relationship with another (possibly new, custom) table if you're in need of having multiple values associated to your data.
Only other "unusual" thing I can think of is how the OwnerId field on certain objects (Case, Lead, maybe something else) can be used to point to User or Queue record. Looks weird when you are used to foreign key relationships from traditional databases. But this is not identical with what you're asking as there will be only one value at a time.
Of course you might be surpised sometimes with values you'll see in the database depending on the viewing user's locale (stuff like System Administrator profile becoming Systeembeheerder in Dutch). But this will be still a single value, translated on the fly just before the query results are sent back to you.
When I had to perform SOAP integration with SFDC, I've always used WSDL files and most of the time was fine with Java code generated out of them with Apache Axis. Hand-crafting the SOAP message yourself seems... wow, hardcore a bit. Are you sure you prefer visualisation of XML over the creation of classes, exceptions and all this stuff ready for use with one of several out-of-the-box integration methods? If they'll ever change the WSDL I need just to regenerate the classes from it; whereas changes to your SOAP message creation library might be painful...

machine learning and code generator from strings

The problem: Given a set of hand categorized strings (or a set of ordered vectors of strings) generate a categorize function to categorize more input. In my case, that data (or most of it) is not natural language.
The question: are there any tools out there that will do that? I'm thinking of some kind of reasonably polished, download, install and go kind of things, as opposed to to some library or a brittle academic program.
(Please don't get stuck on details as the real details would restrict answers to less generally useful responses AND are under NDA.)
As an example of what I'm looking at; the input I'm wanting to filter is computer generated status strings pulled from logs. Error messages (as an example) being filtered based on who needs to be informed or what action needs to be taken.
Doing Things Manually
If the error messages are being generated automatically and the list of exceptions behind the messages is not terribly large, you might just want to have a table that directly maps each error message type to the people who need to be notified.
This should make it easy to keep track of exactly who/which-groups will be getting what types of messages and to update the routing of messages should you decide that some of the messages are being misdirected.
Typically, a small fraction of the types of errors make up a large fraction of error reports. For example, Microsoft noticed that 80% of crashes were caused by 20% of the bugs in their software. So, to get something useful, you wouldn't even need to start with a complete table covering every type of error message. Instead, you could start with just a list that maps the most common errors to the right person and routes everything else to a person for manual routing. Each time an error is routed manually, you could then add an entry to the routing table so that errors of that type are handled automatically in the future.
Document Classification
Unless the error messages are being editorialized by people who submit them and you want to use this information when routing them, I wouldn't recommend treating this as a document classification task. However, if this is what you want to do, here's a list of reasonably good packages for document document classification organized by programming language:
Python - To do this using the Python based Natural Language Toolkit (NLTK), see the Document Classification section in the freely available NLTK book.
Ruby - If Ruby is more of your thing, you can use the Classifier gem. Here's sample code that detects whether Family Guy quotes are funny or not-funny.
C# - C# programmers can use nBayes. The project's home page has sample code for a simple spam/not-spam classifier.
Java - Java folks have Classifier4J, Weka, Lucene Mahout, and as adi92 mentioned Mallet.
Learning Rules with Weka - If rules are what you want, Weka might be of particular interest, since it includes a rule set based learner. You'll find a tutorial on using Weka for text categorization here.
Mallet has a bunch of classifiers which you can train and deploy entirely from the commandline
Weka is nice too because it has a huge number of classifiers and preprocessors for you to play with
Have you tried spam or email filters? By using text files that have been marked with appropriate categories, you should be able to categorize further text input. That's what those programs do, anyway, but instead of labeling your outputs a 'spam' and 'not spam', you could do other categories.
You could also try something involving AdaBoost for a more hands-on approach to rolling your own. This library from Google looks promising, but probably doesn't meet your ready-to-deploy requirements.