I am exploring the possibility of implementing Featuretools into my pipeline, to be able to create new features from my Df.
Currently I am using a GridSearchCV, with a Pipeline embedded inside it. Since Featuretools is creating new features with aggregation on columns, like STD(column) etc, I feel like it is suspectible to data leakage. In their FAQ, they are giving an example approach to tackle it, which is not suitable for a Pipeline structure I am using.
Idea 0: I would love to integrate it directly into my Pipeline but it seems like not compatible with Pipelines. It would use fold train data to construct features, transform fold test data. K times. At the end, it would use whole data to construct, during Refit= True stage of GridSearchCV. If you have any example opposed to this fact, you are very welcome.
Idea 1: I can switch to a manual CV structure, not embedded into pipeline. And inside it, I can use Train data to construct new features, and test data to transform with these. It will work K times. At the end, all data can be used to construct Ultimate model.
It is the safest option, with time and complexity disadvantages.
Idea 2: Using it with whole data, ignore the leakage possibility. I am not in favor of this of course. But when I look at Project Github page, all the examples are combining Train and Test data, creating these features with whole data. Then go on with Train-Test division for modeling.
https://github.com/Featuretools/predict-taxi-trip-duration/blob/master/NYC%20Taxi%203%20-%20Simple%20Featuretools.ipynb
Actually if the developers of the project think like that, I could give it a chance with whole data.
What do you think, I would love to hear about your experiences on FeatureTools.
Related
In Weaviate, the vector engine, I wonder how this can handle version issue of embedding model.
For instance, considering the (trained) word2vec model, embedded vectors from different models must be seperated.
One option might think is that make distinct multiple classes representing model version.
Custom script may useful. If new model available, create new class and import accorded data. After that, change (GET) entrypoints (used for searching nearest vectors) to the new class.
Or maybe weaviate have other fancy way to handle this issue, but I couldn't find.
As at version 1.17.3, you have to manage this yourself because weaviate only supports one embedding per object.
There is a feature request to allow multiple embeddings per object here. But it sounds like your request is closer to this one. In any case, have a look at them and upvote the one that addresses your need so the engineering team can prioritize accordingly. Also, feel free to raise a new feature request if neither of these addresses your needs.
I'm getting started with Knowledge Studio and Natural Language Understanding.
I'm able to deploy a machine-learning model toNatural Language Understanding and use the API to query it.
I would know if there's a way to deploy only the pre-annotator.
I read from Knowledge Studio's documentation that
You can deploy or export a machine-learning annotator. A dictionary pre-annotator can only be used to pre-annotate documents within Watson Knowledge Studio.
Does exist a workaround to create a model that simply does the job of the pre-annotator, i.e. use dictionaries to find entities instead of the machine-learning model?
Does exist a workaround to create a model that simply does the job of the pre-annotator, i.e. use dictionaries to find entities instead of the machine-learning model?
You may need to explain this better in what you need.
WKS allows you to pre-annotate documents with dictionaries you upload. Once you have created a ML model, you can alternatively use that to annotate your training documents, and then manually correct. As you continue the amount of manual work will reduce after each model iteration.
The assumption is that you are creating a model with a reasonable amount of examples. In your model results, you will want the mention/relations to be outside or close to outside the gray area of the report.
The other interpretation of your request I took was you want to create a dictionary based model only. This is possible using the "Rule-Based Model" functionality. You would have to create the parsing rules but you just map what you want to find to the dictionary/rule.
Using this in production though is still limited. You should get a warning when you deploy these kinds of models.
It's slightly better than just a keyword search as you can map items to parts of speech.
The last point. The purpose of WKS is to create a machine learning model which will do the work in discovering new terms you haven't seen before. With the rule based engine it can only find what you explicitly tell it to find.
If all you want is just dictionary entries, then you can create a very simple string comparison solution, but you lose the linguistic features.
I try to create the easiest of a NeuralNetwork and training it with some data:
Therefore I created a test.csv with a the following pattern:
number,number+1;
number2,number2+1
...
I try to make a linear regression with the network...
But I do not find a way to acquire the data, DataSetIterator does not work.
How to fit the Data, how to test the Data?
In our examples, we encourage people to use datavec + recordreaderdatasetiterator.
Datavec has all of the various data loading components.
I'm not sure what you mean about "datasetiterator not working" wihtout seeing any code, but it seems like you didn't really look at our examples.
In there are multiple examples of a csv record reader you can use for both regression and classification use cases.
Consider reorienting your data pipeline to use those.
Those examples are always found here:
https://github.com/deeplearning4j/dl4j-examples
If you follow any of those, the same pattern emerges:
Record reader for whatever data format -> RecordReaderDataSetIterator
The iterator allows you to specify common constructors such as whether it is a regression or not, which column your label is etc.
I am looking for some tool capable of creating complex process of data manipulation which can be more or less easily modified by people who do not write code.
For example, my task is:
fetch data from sourceA
2.1 if data is full - filter it by condition 45
2.2 if data is not full - fetch additional data from source B
if result passes validation - return 1, otherwise 0
This should be described in some readable manner, best option is if one can modify this process in some UI tool.
What are the requirements?
Each process consists of two parts: steps, and a way to arrange them in a sequence.
(1)
The process in each step should be able to
1. emit commands for fetching some data from data-sources and inserting this into process context
2. filter, enrich, transform datasets obtained
Thus each step of this process should be described with some more or less simple DSL.
(2)
The selection of the step to go, i.e. the consequence of steps should be described by some visual tool, or again, as in (1), with some simple dsl.
Can you advise something for this typical, from my point of view, task?
Meanwhile, here are my own ideas.
First think comes to mind is BPMN combined with Drools.
For steps I may use DRL rules: they can make only basic data manipulation themselves, but I can call Java functions from them if I need something complicated.
For steps consequence I may use standart BPMN diagramm.
Mat be, there is something better?
The combination of BPMN with DMN would allow you indeed to describe with these visual standards, the execution of the process and decision logic to be applied, in order to achieve what in the "For example" paragraph.
In order to make it fully accessible by the business people, the BPMN task for fetching the data or performing any interaction with external system, should be prepared in advance and made available during the composition of the BPMN/DMN diagrams.
Alternatively to BPMN+DMN combination, you can look into Fuse or Fuse Online, it cannot describe all the semantics of the BPMN+DMN combination, but with Fuse Online for instance you can fully visually implement the steps you described in the "For example" paragraph.
I Studied both data-driven and keyword driven approaches. After reading, It seems data driven is better than keyword. For documentation purpose keyword sounds great. But it has many levels. I need guidance from people who actually have implemented Automation frameworks. Personally, I want to store all data in database or excel and break up the system into modular parts (functions that are common to major company products).
Currently using, WatiN, Nunit, CC.net
Any advise pls
I would hightly recommend that you look into the stack that Michael Hunter aka the braidy tester built for testing expression at Microsoft he has a lot of articles about it http://www.thebraidytester.com/stack.html
Esentially he splits out into a logical model, a physical model and a data model and all three are loosley copupled. All my stacks are written this way now. So the test cases end up looking like this:
Logical.Google.Search.Websearch("watin");
Verification.VerifySearchResult("watin");
All the test data is then stored in a sql express database that indexed by the text string, in this case watin.
You will need to build a full domain model and data access layer, I personally auto generate that using SubSonic.