PredictionIO data importing - prediction

I'm considering using PredictionIO for building a music recommendation system.
However, in the user-item interaction, only the following actions are supported: like, dislike, view, conversion, and rate (scale 1 - 5).
My existing data consists only of views (users listen to songs).
How should I translate my data to PredictionIO input? Can I have multiple view records for the same user-item (more views = more weight) or will PredictionIO look at the most recent one based on timestamp?

I only use Mahout and so don't know how PredictionIO translates your data into input and algorithm choice.
For Mahout you'd use the item-based recommender using a boolean input and input your "listen" as the action with a value of 1 and the similarity metric SIMILARITY_LOGLIKELIHOOD. LLR ignores the weights anyway. Weights are used in old-style recommenders that try to predict a user's ratings. Most people these days are more interested in ranking better and the above config will give you the best results.

Just for completeness, as answered in the PredictionIO user group:
As of current version 0.7, the built-in algos won't aggregate multiple
view actions, so more view actions do not mean more weight. You may
customize the algorithm to handle that.
If you have multiple U2I actions (e.g., view, like, rate) you can define the conflict resolution criteria (e.g., latest, highest). But in your case with a single action (i.e., user listens to song) multiple "views" will be equivalent to a single "view".
The next version of PredictionIO will have much better support for custom algorithms and engines (e.g., music recommendation).

Related

Using Google Natural Language API or AutoML for sentiment detection of a specific condition

What we like to do is to analyze conversation and detect when there is negative sentiment. What I mean by this is that we specifically want to detect if the user on the call is angry or frustrated or combative and needs to be transferred. We had plan to use the natural language sentiment,but the problem is that the sentiment analysis only detect if a statement is positive or negative. For example:
I am unable to login because it said my password is expired.
This would result in a negative sentiment, but the user is stating something and is not an indication that the user is combative.
I could perform some sort of entity analysis and it would return a list of predefined entity types like "Person". However, it does not appear to allow me to create new entity types nor can I adjust the criteria for entity type.
Is my best bet to look into AutoML? With this I would have more flexibility, but what would be the cost difference between using Natural Language API vs the automl api?
Thanks.
Models used in Google Natural Language API have been trained on enormously large document corpuses, their performance is usually quite good as long as they are used on datasets that do not make use of a very idiosyncratic language.
On the other hand, the AutoML model performance has a quite slow training process and has different models[1]. The AutoML sentiment analysis model might be very convenient. However, for the performance of critical tasks, it makes sense to invest the time and develop the model yourself to have better results.For pricing of AutoML, you can check the link[2] below to calculate the price you prefer.
[1]https://cloud.google.com/natural-language/automl/docs/features
[2]https://cloud.google.com/vision/automl/pricing

What is currently the best way to add a custom dictionary to a neural machine translator that uses the transformer architecture?

It's common to add a custom dictionary to a machine translator to ensure that terminology from a specific domain is correctly translated. For example, the term server should be translated differently when the document is about data centers, vs when the document is about restaurants.
With a transformer model, this is not very obvious to do, since words are not aligned 1:1. I've seen a couple of papers on this topic, but I'm not sure which would be the best one to use. What are the best practices for this problem?
I am afraid you cannot easily do that. You cannot easily add new words to the vocabulary because you don't know what embedding it would get during training. You can try to remove some words, or alternatively you can manually change the bias in the final softmax layer to prevent some words from appearing in the translation. Anything else would be pretty difficult to do.
What you want to do is called domain adaptation. To get an idea of how domain adaptation is usually done, you can have a look at a survey paper.
The most commonly used approaches are probably model finetuning or ensembling with a language model. If you want to have parallel data in your domain, you can try to fine-tune your model on that parallel data (with simple SGD, small learning rate).
If you only have monolingual data in the target language, you train a language model on that data. During the decoding, you can mix the probabilities from the domain-specific language and the translation model. Unfortunately, I don't know of any tool that could do this out of the box.

What are the non-rating based recommendation systems

What are the non-rating based recommendation systems?
I could have used Recommenderlab but it needs a rating matrix as a input.
If you are looking for recommender system packages which can handle purchase data i would recommend take a look at https://github.com/benfred/implicit or at https://github.com/lyst/lightfm. Both packages can handle data sets with no explicit rating information (like for example 5 star movie ratings). Just google "implicit feedback recommender systems" which is probably what you want.
I think your data set is a boolean data set like; userId, itemId where rating doesn't exist.
Here is a solution with Apache Mahout for boolean data sets;
http://kickstarthadoop.blogspot.com.tr/2011/05/generating-recommendations-with-mahout_26.html

Are Operational Transformation Frameworks only meant for text?

Looking at all the examples of Operational Transformation Frameworks out there, they all seem to resolve around the transformation of changes to plain text documents. How would an OT framework be used for more complex objects?
I'm wanting to dev a real-time sticky notes style app, where people can co-create sticky notes, change their positon and text value. Would I be right in assuming that the position values wouldn't be transformed? (I mean, how would they, you can't merge them right?). However, I would want to use an OT framework to resolve conflicts with the posit-its value, correct?
I do not see any problem to use Operational Transformation to work with Complex Objects, what you need is to define what operations your OT system support and how concurrency is solved for them
For instance, if you receive two Sticky notes "coordinates move operation" from two different users from same 'client state', you need to make both states to converge, probably cancelling out second operation.
This is exactly the same behaviour with text when two users generate two updates to delete a text range that overlaps completely, (or maybe partially), the second update processed must be transformed against the previous and the resultant operation will only effectively delete a portion of the original one, (or completely cancelled with a 'no-op')
You can take a look on this nice explanation about how Google Wave Operational Transformation works and guess from this point how it should work your own implementation
See the following paper for an approach to using OT with trees if you want to go down that route:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.100.74
However, in your particular case, I would use a separate plain text OT document for each stickynote and use an existing library, eg: etherPad, to do the heavy lifting. The positions of the notes could then be broadcast on a last-committer-wins basis.
Operation Transformation is a general technique, it works for any data type. The point is you need to define your transformation functions. Also, there are some atomic attributes that you cannot merge automatically like (position and background color) those will be mostly "last-update wins" or the user solves them manually when there is a conflict.
there are some nice libs and frameworks that provide OT for complex data already out there:
ShareJS : library for Node which provides all operations on JSON objects
DerbyJS: framework for NodeJS, it uses ShareJS for OT stuff.
Open Coweb framework : Dojo foundation project for cooperative web applications using OT

Tools for getting intent from Twitter statuses? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
I am considering a project in which a publication's content is augmented by relevant, publicly available tweets from people in the area. But how could I programmatically find the relevant Tweets? I know that generating a structure representing the meaning of natural language is pretty much the holy grail of NLP, but perhaps there's some tool I can use to at least narrow it down a bit?
Alternatively, I could just use hashtags. But that requires more work on behalf of the users. I'm not super familiar with Twitter - do most people use hashtags (even for smaller scale issues), or would relying on them cut off a large segment of data?
I'd also be interested in grabbing Facebook statuses (with permission from the poster, of course), and hashtag use is pretty rare on Facebook.
I could use simple keyword search to crudely narrow the field, but that's more likely to require human intervention to determine which tweets should actually be posted alongside the content.
Ideas? Has this been done before?
There are two straightforward ways to go about finding tweets relevant to your content.
The first would be to treat this as a supervised document classification task, whereby you would train a classifier to annotate tweets with a certain predetermined set of topic labels. You could then use the labels to select tweets that are appropriate for whatever content you'll be augmenting. If you don't like using a predetermined set of topics, another approach would be to simply score tweets according to their semantic overlap with your content. You could then display the top n tweets with the most semantic overlap.
Supervised Document Classification
Using supervised document classification would require that you have a training set of tweets labeled with the set of topics you'll be using. e.g.,
tweet: NBA finals rocked label: sports
tweet: Googlers now allowed to use Ruby! label: programming
tweet: eating lunch label: other
If you want to collect training data without having to manually label the tweets with topics, you could use hashtags to assign topic labels to the tweets. The hashtags could be identical with the topic labels, or you could write rules to map tweets with certain hashtags to the desired label. For example, tweets tagged either #NFL or #NBA could all be assigned a label of sports.
Once you have the tweets labeled by topic, you can use any number of existing software packages to train a classifier that assigns labels to new tweets. A few available packages include:
NLTK (Python) - see Chapter 6 in the NLTK book on Learning to Classify Text
Classifier4J (Java)
nBayes (C#)
Semantic Overlap
Finding tweets using their semantic overlap with your content avoids the need for a labeled training set. The simplest way to estimate the semantic overlap between your content and the tweets that you're scoring is to use a vector space model. To do this, represent your document and each tweet as a vector with each dimension in the vector corresponding to a word. The value assigned to each vector position then represents how important that word is to the meaning of document. One way to estimate this would be to simply use the number of times the word occurs in the document. However, you'll likely get better results by using something like TF/IDF, which up-weights rare terms and down-weights more common ones.
Once you've represented your content and the tweets as vectors, you can score the tweets by their semantic similarity to your content by taking the cosine similarity of the vector for your content and the vector for each tweet.
There's no need to code any of this yourself. You can just use a package like Classifier4J, which includes a VectorClassifier class that scores document similarity using a vector space model.
Better Semantic Overlap
One problem you might run into with vector space models that use one term per dimension is that they don't do a good job of handling different words that mean roughly the same thing. For example, such a model would say that there is no similarity between The small automobile and A little car.
There are more sophisticated modeling frameworks such as latent semantic analysis (LSA) and latent dirichlet allocation (LDA) that can be used to construct more abstract representations of the documents being compared to each other. Such models can be thought of as scoring documents not based on simple word overlap, but rather in terms of overlap in the underlying meaning of the words.
In terms of software, the package Semantic Vectors provides a scalable LSA-like framework for document similarity. For LDA, you could use David Blei's implementation or the Stanford Topic Modeling Toolbox.
Great question. I think for twitter your best bet is to use hashtags because otherwise you need to create algorithms or find existing algorithms that do language analysis and improve over time based on user input/feedback.
For facebook you can kind of do what bing implemented a while back. As I covered in this article here:
http://www.socialtimes.com/2010/06/bing-adds-facebook-and-twitter-features-steps-up-social-services/
I wrote: For example, a search for “NBA Finals” will return fan-page content from Facebook, including posts from a local TV station. So if you're trying to augmented NBA related content, you could do a similar search as Bing provides - searching publically available fan-page content the way spiders index them for search engines. I'm not a developer so i'm not sure of the intricacies but I know it can be done.
Also you can display popular shared links from users who are publishing to ‘everyone’ will be aggregated for all non-fan page content. I'm not sure if this is limited to being published to 'everyone' and/or being 'popular' although I would assume so - but you can double check that.
Hope this helps
The problem with NLP is not the algorithm (although that is an issue) the problem is the resources. There are some open source shallow parsing tools (that's all you would need to get intent) that you could use but parsing thousands or millions of tweets would cost a fortune in computer time.
On the other hand like you said not all tweets have hashtags and there is no promise they will be relevant.
Maybe you can use a mixture of keyword search to filter out a few possibilities (those with the highest keyword density) and then use a deeper data analysis to pick the top 1 or 2. This would keep computer resources at a minimum and you should be able to get relevant tweets.