I am working on a project in which I have about 18 classes with about 4,000 total instances. I have 7 attributes, 1 being string data, the rest nominal. I am currently using StringToWordVector on the string attribute with Platt's SMO classifier, achieving good results. We are about to implement this, but I would like to try other classifiers in case there maybe one I could get better results from. Any suggestions?
Also, should I be using MultiClassClassifier with so many classes? If so, what settings should I try within that?
Any advice is appreciated!
An AdaBoosted J48 Decision Tree yielded the best results has been well established in our division
Related
I am exploring the possibility of implementing Featuretools into my pipeline, to be able to create new features from my Df.
Currently I am using a GridSearchCV, with a Pipeline embedded inside it. Since Featuretools is creating new features with aggregation on columns, like STD(column) etc, I feel like it is suspectible to data leakage. In their FAQ, they are giving an example approach to tackle it, which is not suitable for a Pipeline structure I am using.
Idea 0: I would love to integrate it directly into my Pipeline but it seems like not compatible with Pipelines. It would use fold train data to construct features, transform fold test data. K times. At the end, it would use whole data to construct, during Refit= True stage of GridSearchCV. If you have any example opposed to this fact, you are very welcome.
Idea 1: I can switch to a manual CV structure, not embedded into pipeline. And inside it, I can use Train data to construct new features, and test data to transform with these. It will work K times. At the end, all data can be used to construct Ultimate model.
It is the safest option, with time and complexity disadvantages.
Idea 2: Using it with whole data, ignore the leakage possibility. I am not in favor of this of course. But when I look at Project Github page, all the examples are combining Train and Test data, creating these features with whole data. Then go on with Train-Test division for modeling.
https://github.com/Featuretools/predict-taxi-trip-duration/blob/master/NYC%20Taxi%203%20-%20Simple%20Featuretools.ipynb
Actually if the developers of the project think like that, I could give it a chance with whole data.
What do you think, I would love to hear about your experiences on FeatureTools.
I have several different .weight files that were outputted in training. The reason I did this, is I noticed the model trained better with fewer classes than if I combined all 35 together. Could it possible to loop through code and have multiple model.load_weights()?
Any help is appreciated!
Thanks
I don't see the code, but I can say, that you can try to create multiple class instances of the model class, each of them with their own weights and configs, and than run each of them in the way you want
I am using Google autoML tables for a multiclass classification problem where the number of classes is 110. For some reasons, the learned model only predicts probabilities for 40 classes and I don't understand why. These classes being the most frequent in the training set. Any help?
Thank you!
Yassine
I think that you reported this issue here as well.
Quoting the response given there:
It seems that the service is only returning the top results instead of the results for all the available classes. The AutoML Tables Engineering team is aware of this behavior and currently working on a fix.
So, for those interested, the suggestion is to subscribe to the IssueTracker to receive a notification whenever there's an update on the issue.
I try to create the easiest of a NeuralNetwork and training it with some data:
Therefore I created a test.csv with a the following pattern:
number,number+1;
number2,number2+1
...
I try to make a linear regression with the network...
But I do not find a way to acquire the data, DataSetIterator does not work.
How to fit the Data, how to test the Data?
In our examples, we encourage people to use datavec + recordreaderdatasetiterator.
Datavec has all of the various data loading components.
I'm not sure what you mean about "datasetiterator not working" wihtout seeing any code, but it seems like you didn't really look at our examples.
In there are multiple examples of a csv record reader you can use for both regression and classification use cases.
Consider reorienting your data pipeline to use those.
Those examples are always found here:
https://github.com/deeplearning4j/dl4j-examples
If you follow any of those, the same pattern emerges:
Record reader for whatever data format -> RecordReaderDataSetIterator
The iterator allows you to specify common constructors such as whether it is a regression or not, which column your label is etc.
I am trying to create a graph in OrientDB where the weight of edges has to be calculated on demand using data from another database. I would like to know if there is a way to do this, since all example I´ve seen use static weight properties, none of which is dynamic by nature.
If I could use a stored function as a property and have it evaluate each time I call shortestPath then it would solve my problem, but I haven´t found any documentation on this topic.
Help would be greatly appreciated!
This isn't supported by OrientDB out of the box, even if it would be something nice to have. Could you open a new issue?
About the solution I suggest to clone the OSQLFunctionDijkstra class, do your changes and plug into OrientDB engine with a different name.