Multiclass classification with Google AutoML Tables - classification

I am using Google autoML tables for a multiclass classification problem where the number of classes is 110. For some reasons, the learned model only predicts probabilities for 40 classes and I don't understand why. These classes being the most frequent in the training set. Any help?
Thank you!
Yassine

I think that you reported this issue here as well.
Quoting the response given there:
It seems that the service is only returning the top results instead of the results for all the available classes. The AutoML Tables Engineering team is aware of this behavior and currently working on a fix.
So, for those interested, the suggestion is to subscribe to the IssueTracker to receive a notification whenever there's an update on the issue.

Related

Productize a model with Target Encoding

I am new to data science and I am experimenting with target encoding for my dataset that has several columns with multiple categories (I have discovered that one hot encoding has failed me from encountering a real-world dataset). While I was building my model using the insights I gained from https://github.com/groverpr/Machine-Learning/tree/9963e59823fe0ff18cc8e1b2657b71c01f133193 and https://github.com/scikit-learn-contrib/category_encoders I couldn't help but wonder how these models are able to be used for real-world situations post training. I also wonder how target encoding can even be used for feature selection. I would be very grateful for any clarification on this topic.

Weka Classification Project Using StringToWordVector and SMO

I am working on a project in which I have about 18 classes with about 4,000 total instances. I have 7 attributes, 1 being string data, the rest nominal. I am currently using StringToWordVector on the string attribute with Platt's SMO classifier, achieving good results. We are about to implement this, but I would like to try other classifiers in case there maybe one I could get better results from. Any suggestions?
Also, should I be using MultiClassClassifier with so many classes? If so, what settings should I try within that?
Any advice is appreciated!
An AdaBoosted J48 Decision Tree yielded the best results has been well established in our division

usage of naive bayes Model for prediction

Hi all I am new to scala and spark MLIB.
I have a dataset of diseses of diseases along with the symptoms which are in the following format:
Disease,symptom1 symptom2 symptom3
I have almost 300 entries which are in the above mentioned format in a CSV file.
I want to achieve this following functionality:
If a user has given a input of sysmptoms namely Symptom1,Symptom2,Symptom3 the model must be able to predict the disease.
I have the following Questions:
which machine learning model should I use to achieve this functionality.
I have gone through some models and founf NAIVES Bayes model if wrong correct me.
can I provide text input to Naives Bayes model.
Is there any sample code available to achieve this functionality.
You can use any of the classification algorithms present in Spark MLlib for further reference read the official docs and go thru this link from databricks blog https://databricks.com/blog/2015/07/29/new-features-in-machine-learning-pipelines-in-spark-1-4.html

Sentiment Analysis - What does annotating dataset mean?

I'm currently working on my final year research project, which is an application which analyzes travel reviews found online, and give out a sentiment score for particular tourist attractions as a result, by conducting aspect level sentiment analysis.
I have a newly scraped dataset from a famous travel website which does not allow to use their API for research/academic purposes. (bummer)
My supervisor said that I might need to get this dataset annotated before using it for the aforementioned purpose. I am kind of confused as to what data annotation means in this context. Could someone please explain what exactly is happening when a dataset is annotated and how it helps in getting sentiment analysis done?
I was told that I might have to get two/three human annotators and get the data annotated to make it less biased. I'm on a tight schedule and I was wondering if there are any tools that can get it done for me? If so, what will be the impact of using such tools over human annotators? I would also like suggestions for such tools that you would recommend.
I would really appreciate a detailed explanation to my questions, as I am stuck with my project progressing to the next step because of this.
Thank you in advance.
To a first approximation, machine learning algorithms (e.g., a sentiment analysis algorithm) is learning to perform a task that humans currently perform by collecting many examples of the human performing the task, and then imitating them. When your supervisor talks about "annotation," they're talking about collecting these examples of a human doing the sentiment annotation task: annotating a sentence for sentiment. That is, collecting pairs of sentences and their sentiment as judged by humans. Without this, there's nothing for the program to learn from, and you're stuck hoping the program can give you something from nothing -- which it never will.
That said, there are tools for collecting this sort of data, or at least helping. Amazon Mechanical Turk and other crowdsourcing platforms are good resources for this sort of data collection. You can also take a look at something like: http://www.crowdflower.com/type-sentiment-analysis.

How do I adapt my recommendation engine to cold starts?

I am curious what are the methods / approaches to overcome the "cold start" problem where when a new user or an item enters the system, due to lack of info about this new entity, making recommendation is a problem.
I can think of doing some prediction based recommendation (like gender, nationality and so on).
You can cold start a recommendation system.
There are two type of recommendation systems; collaborative filtering and content-based. Content based systems use meta data about the things you are recommending. The question is then what meta data is important? The second approach is collaborative filtering which doesn't care about the meta data, it just uses what people did or said about an item to make a recommendation. With collaborative filtering you don't have to worry about what terms in the meta data are important. In fact you don't need any meta data to make the recommendation. The problem with collaborative filtering is that you need data. Before you have enough data you can use content-based recommendations. You can provide recommendations that are based on both methods, and at the beginning have 100% content-based, then as you get more data start to mix in collaborative filtering based.
That is the method I have used in the past.
Another common technique is to treat the content-based portion as a simple search problem. You just put in meta data as the text or body of your document then index your documents. You can do this with Lucene & Solr without writing any code.
If you want to know how basic collaborative filtering works, check out Chapter 2 of "Programming Collective Intelligence" by Toby Segaran
Maybe there are times you just shouldn't make a recommendation? "Insufficient data" should qualify as one of those times.
I just don't see how prediction recommendations based on "gender, nationality and so on" will amount to more than stereotyping.
IIRC, places such as Amazon built up their databases for a while before rolling out recommendations. It's not the kind of thing you want to get wrong; there are lots of stories out there about inappropriate recommendations based on insufficient data.
Working on this problem myself, but this paper from microsoft on Boltzmann machines looks worthwhile: http://research.microsoft.com/pubs/81783/gunawardana09__unified_approac_build_hybrid_recom_system.pdf
This has been asked several times before (naturally, I cannot find those questions now :/, but the general conclusion was it's better to avoid such recommendations. In various parts of the worls same names belong to different sexes, and so on ...
Recommendations based on "similar users liked..." clearly must wait. You can give out coupons or other incentives to survey respondents if you are absolutely committed to doing predictions based on user similarity.
There are two other ways to cold-start a recommendation engine.
Build a model yourself.
Get your suppliers to fill in key information to a skeleton model. (Also may require $ incentives.)
Lots of potential pitfalls in all of these, which are too common sense to mention.
As you might expect, there is no free lunch here. But think about it this way: recommendation engines are not a business plan. They merely enhance the business plan.
There are three things needed to address the Cold-Start Problem:
The data must have been profiled such that you have many different features (with product data the term used for 'feature' is often 'classification facets'). If you don't properly profile data as it comes in the door, your recommendation engine will stay 'cold' as it has nothing with which to classify recommendations.
MOST IMPORTANT: You need a user-feedback loop with which users can review the recommendations the personalization engine's suggestions. For example, Yes/No button for 'Was This Suggestion Helpful?' should queue a review of participants in one training dataset (i.e. the 'Recommend' training dataset) to another training dataset (i.e. DO NOT Recommend training dataset).
The model used for (Recommend/DO NOT Recommend) suggestions should never be considered to be a one-size-fits-all recommendation. In addition to classifying the product or service to suggest to a customer, how the firm classifies each specific customer matters too. If functioning properly, one should expect that customers with different features will get different suggestions for (Recommend/DO NOT Recommend) in a given situation. That would the 'personalization' part of personalization engines.