collaborative filtering with implicit feedback , How to set preferences? - recommendation-engine

I have a dataset with only two fields itemId, productid, i would like to try mahout ALS or mllib for implicit feedback, is the best approach to create the preference column in the dataset with all 1's? reading koren paper (Collaborative Filtering for Implicit Feedback Datasets) i see that all confidence interval would be the same with same preferences, it is ok? thanks!

If You plan to try ALS for Your recommendation I will encourage You to try directly mllib/ML from Spark. It has blocked implementation of ALS - it's really fast. Just when creating RDD[Rating] give for the rating value 1.0. Remember to learn ALS with parameter say You gave implicit feedback (implicitPrefs = True).
You can check example here:
https://github.com/apache/spark/blob/v1.3.1/examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala#L113
And if You are brave enough to use new ML package with DataFrames:
https://github.com/apache/spark/blob/v1.3.1/examples/src/main/scala/org/apache/spark/examples/ml/MovieLensALS.scala
Probably it's better for experimentation with parameters with new pipeline.
But I find it difficult to work with my code were I have to access factorized matrices from the learned model.
Good luck and have fun!

Related

How to implement Featuretools into my ML Process?

I am exploring the possibility of implementing Featuretools into my pipeline, to be able to create new features from my Df.
Currently I am using a GridSearchCV, with a Pipeline embedded inside it. Since Featuretools is creating new features with aggregation on columns, like STD(column) etc, I feel like it is suspectible to data leakage. In their FAQ, they are giving an example approach to tackle it, which is not suitable for a Pipeline structure I am using.
Idea 0: I would love to integrate it directly into my Pipeline but it seems like not compatible with Pipelines. It would use fold train data to construct features, transform fold test data. K times. At the end, it would use whole data to construct, during Refit= True stage of GridSearchCV. If you have any example opposed to this fact, you are very welcome.
Idea 1: I can switch to a manual CV structure, not embedded into pipeline. And inside it, I can use Train data to construct new features, and test data to transform with these. It will work K times. At the end, all data can be used to construct Ultimate model.
It is the safest option, with time and complexity disadvantages.
Idea 2: Using it with whole data, ignore the leakage possibility. I am not in favor of this of course. But when I look at Project Github page, all the examples are combining Train and Test data, creating these features with whole data. Then go on with Train-Test division for modeling.
https://github.com/Featuretools/predict-taxi-trip-duration/blob/master/NYC%20Taxi%203%20-%20Simple%20Featuretools.ipynb
Actually if the developers of the project think like that, I could give it a chance with whole data.
What do you think, I would love to hear about your experiences on FeatureTools.

Convert PySpark ML Word2Vec model to Gensim Word2Vec model

I've generated a PySpark Word2Vec model like so:
from pyspark.ml.feature import Word2Vec
w2v = Word2Vec(vectorSize=100, minCount=1, inputCol='words', outputCol = 'vector')
model = w2v.fit(df)
(The data that I used to train the model on isn't relevant, what's important is that its all in the right format and successfully yields a pyspark.ml.feature.Word2VecModel object.)
Now I need to convert this model to a Gensim Word2Vec model. How would I go about this?
If you still have the training data, re-training the gensim Word2Vec model may be the most straightforward approach.
If you only need the word-vectors, perhaps PySpark's model can export them in the word2vec.c format that gensim can load with .load_word2vec_format().
The only reason to port the model would be to continue training. Such incremental training, while possible, involves considering a lot of tradeoffs in balancing the influence of the older and later training to get good results.
If you are in fact wanting to do this conversion in order to do more training in such a manner, it again suggests that using the original training to reproduce a similar model could be plausible.
But, if you have to convert the model, the general approach would be to study the source code and internal data structures of the two models, to discover how they alternatively represent each of the key aspects of the model:
the known word-vectors (model.wv.vectors in gensim)
the known-vocabulary of words, including stats about word-frequencies and the position of individual words (model.wv.vocab in gensim)
the hidden-to-output weights of the model (`model.trainables' and its properties in gensim)
other model properties describing the model's modes & metaparameters
A reasonable interactive approach could be:
Write some acceptance tests that take models of both types, and test whether they are truly 'equivalent' for your purposes. (This is relatively easy for just checking if the vectors for individual words are present and identical, but nearly as hard as the conversion itself for verifying other ready-to-be-trained-more behaviors.)
Then, in an interactive notebook, load the source model, and also create a dummy gensim model with the same vocabulary size. Consulting the source code, write Python statements to iteratively copy/transform key properties over from the source into the target, repeatedly testing if they verify as equivalent.
When they do, take those steps you did manually and combine them into a utility method to do the conversion. Again verify its operation then try using the converted model however you'd hoped – perhaps discovering overlooked info or discovering other bugs in the process, and then improving the verification method and conversion method.
It's possible that the PySpark model will be missing things the gensim model expects, which might require synthesizing workable replacement values.
Good luck! (But re-train the gensim model from the original data if you want things to just be straightforward and work.)

DeepLearning4J - Acquiring Data and Train Model

I try to create the easiest of a NeuralNetwork and training it with some data:
Therefore I created a test.csv with a the following pattern:
number,number+1;
number2,number2+1
...
I try to make a linear regression with the network...
But I do not find a way to acquire the data, DataSetIterator does not work.
How to fit the Data, how to test the Data?
In our examples, we encourage people to use datavec + recordreaderdatasetiterator.
Datavec has all of the various data loading components.
I'm not sure what you mean about "datasetiterator not working" wihtout seeing any code, but it seems like you didn't really look at our examples.
In there are multiple examples of a csv record reader you can use for both regression and classification use cases.
Consider reorienting your data pipeline to use those.
Those examples are always found here:
https://github.com/deeplearning4j/dl4j-examples
If you follow any of those, the same pattern emerges:
Record reader for whatever data format -> RecordReaderDataSetIterator
The iterator allows you to specify common constructors such as whether it is a regression or not, which column your label is etc.

usage of naive bayes Model for prediction

Hi all I am new to scala and spark MLIB.
I have a dataset of diseses of diseases along with the symptoms which are in the following format:
Disease,symptom1 symptom2 symptom3
I have almost 300 entries which are in the above mentioned format in a CSV file.
I want to achieve this following functionality:
If a user has given a input of sysmptoms namely Symptom1,Symptom2,Symptom3 the model must be able to predict the disease.
I have the following Questions:
which machine learning model should I use to achieve this functionality.
I have gone through some models and founf NAIVES Bayes model if wrong correct me.
can I provide text input to Naives Bayes model.
Is there any sample code available to achieve this functionality.
You can use any of the classification algorithms present in Spark MLlib for further reference read the official docs and go thru this link from databricks blog https://databricks.com/blog/2015/07/29/new-features-in-machine-learning-pipelines-in-spark-1-4.html

Sentiment Analysis - What does annotating dataset mean?

I'm currently working on my final year research project, which is an application which analyzes travel reviews found online, and give out a sentiment score for particular tourist attractions as a result, by conducting aspect level sentiment analysis.
I have a newly scraped dataset from a famous travel website which does not allow to use their API for research/academic purposes. (bummer)
My supervisor said that I might need to get this dataset annotated before using it for the aforementioned purpose. I am kind of confused as to what data annotation means in this context. Could someone please explain what exactly is happening when a dataset is annotated and how it helps in getting sentiment analysis done?
I was told that I might have to get two/three human annotators and get the data annotated to make it less biased. I'm on a tight schedule and I was wondering if there are any tools that can get it done for me? If so, what will be the impact of using such tools over human annotators? I would also like suggestions for such tools that you would recommend.
I would really appreciate a detailed explanation to my questions, as I am stuck with my project progressing to the next step because of this.
Thank you in advance.
To a first approximation, machine learning algorithms (e.g., a sentiment analysis algorithm) is learning to perform a task that humans currently perform by collecting many examples of the human performing the task, and then imitating them. When your supervisor talks about "annotation," they're talking about collecting these examples of a human doing the sentiment annotation task: annotating a sentence for sentiment. That is, collecting pairs of sentences and their sentiment as judged by humans. Without this, there's nothing for the program to learn from, and you're stuck hoping the program can give you something from nothing -- which it never will.
That said, there are tools for collecting this sort of data, or at least helping. Amazon Mechanical Turk and other crowdsourcing platforms are good resources for this sort of data collection. You can also take a look at something like: http://www.crowdflower.com/type-sentiment-analysis.