stereotype user model implementation in Apache Mahout - recommendation-engine

I working on implementation of both individual and stereotype user model in recommendation system. I came across with Apache Mahout but it seems that it only works with individual user model.
My question is how can i work with stereo type user model in Apache Mahout Taste?
My understanding for the recommendation engine is
that you have these core parameters
Method of information acquisition (Implicit or explicit)
User model (Individual or stereotype)
Recommendation techniques (Collaborative or content base)

Taste is being deprecated. Mahout has undergone a major reboot and no longer accepts Hadoop MapReduce code. Many of the Hadoop MapReduce algorithms have been rewritten on the Mahout Samsara codebase that virtualizes a great deal of linear algebra type operations to run on multiple compute engines. The most complete is Spark, which runs something like 10x faster than Hadoop MapReduce.
That as a preamble the new "recommender" implementations, although they include ALS, also have code for item and row similarity, which in recommender data means item and user similarity.
See the description of "spark-rowsimilarity" here: http://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html#2-spark-rowsimilarity
The example is wrong for your case but works just as well to compute user similarities by inputting user interaction vectors.
Another way to do this is put user interaction vectors into a similarity engine that uses Lucene like Solr or Elasticsearch. Then query with a specific user's data and you will get back similar users.

Related

Design of real time sentiment analysis

We're trying to design a real time sentiment analysis system (on paper) for a school project. We got some (very vague) negative feedback on how we store our data, but it isn't fully clear why this would be a bad idea or how we'd best improve this.
The setup of the system is as follows:
data from real time news RSS feeds is gathered in a kafka messaging queue, which connects to our preprocessing platform. This preprocessing step would transform all the news articles into semi-structured data, which we can do a sentiment analysis on.
We then want to store both the sentiment analysis and the preprocessed, semi-structured news article for reference.
We were thinking of using MongoDB as a database to do so since you have a lot of freedom in defining different fields in the value (in the key:value pair you store) instead of Cassandra (which would be faster).
The basic use case is for people to look up an institution and get the sentiment analysis of a bunch of news articles in a certain timeframe.
As a possible improvement: do we need to use a NoSQL database or would it make sense to use a SQL database? I think our system could benefit from being denormalized (as is the case by default in NoSQL) and we wouldn't be needing any operations such as join operations that are significantly faster in SQL systems.
Does anyone know of existing systems that do similar things, for comparison?
Any input would be highly appreciated.

Neptune-Gremlin-Python | Best practises for scaling network analysis and serving use cases like recommendations in realtime

I have a generic question around the best practises on usage of Neptune DB as a network database and its ability to scale up for complex computing. I want to develop a user recommendation system where incoming users on the platform are prompted other users they can likely follow in order to grow the network.
For implementing a simple technique like Triadic Closure, should I use gremlin queries on the Network DB(AWS Neptune in my case) for generating the recommendations? I believe in this case I would have to create python scripts that parallelise queries for multiple nodes and generate recommendation for each node at scale.
OR is it a more common practise to store the network data in the form of nodes, edges and their properties into a relational database, and then perform computations on the same by running SQL queries to load the network data into python, and then using packages like NetworkX on top of that. In this case I won't have to worry about batch computations since a relational database like Redshift would take care of it. However I would be writing python logics to implement techniques such as triadic closure.
Additionallly in the future I may want to use more complex graph computational techniques like graph clustering, partitioning, calculation of different kinds of centralities. Are all/any of these possible within the framework of Neptune+Gremlin.
With the above context below are the questions I am seeking answers for:
Whats is the commonly used tech stack by a data science team working with graph data to build solutions such as user recommendations? By data-science tech stack I mean technologies that help query, analyse, visualise, compute and serve.
Can Neptune + Gremlin replace python packages such as NetworkX for network analysis and centrality measurement?
Is Neptune DB ideal only as a data store OR can it also support complex network analysis and recommendation serving?
Any insight/resources on this would be really helpful!
It is definitely possible to do triadic closure in Gremlin. I have also seen data scientists use both NetworkX and Gremlin together by running the gremlin-python client in a Jupyter Notebook. As this question is quite specific to Amazon Neptune you may want to post to the Neptune support forum at [1]. There are also some useful Gremlin Recipes at [2]
If you post to the support forum I am sure someone will respond.
[1] https://forums.aws.amazon.com/forum.jspa?forumID=253&start=0
[2] http://tinkerpop.apache.org/docs/current/recipes/

Raw tf-idf on elasticsearch

I'm looking for a really scalable method to do tf-idf searching on a large dataset, so elastic search naturally came to mind. Upon reading the default methods they use for scoring, it seems that none actually do raw tf-idf. The closest is their "Practical Scoring Function", but this combines a query norm, coordination factor, etc. I've attached Lucene's formula below.
Is there any way to get elastic search to return a raw tf-idf score without any of the additional fluff? I've tested each of the built-in implementations and none work as well as just tf-idf.
Raw tf-idf in this application would just be -
Also, I'm using one of the AWS provisioned elaticsearch instances, so I don't have access to the internals of the Java code. Only the REST API.

Item recommendation service

I'm supposed to make book recommendation service using MyMediaLite. So far I have collected books from website using Nutch crawler and storing info into hbase. The problems is that I actually not fully understand, how all this thing works. By examples, I have to pass a test data and training data files, with user-item id pairs and rating. But what about other information of book, like categories and authors? How it is possible to find "similar" books, by their information etc, without information about user (so far)? Is it possible to pass data directly from hbase, without storing it to file and then leading in?
Or for this job better suits Apache Mahout or LibRec?
User-item-rating information, often in a matrix, is the basis for collaborative filtering algorithms (user-user CF, item-item CF, matrix factorization, and others). You're using other people's opinions to form recommendations. There's no innate recognition of the content of the items themselves. For that, you'll need some sort of content-based filtering algorithm or data mining technique. These are often used in the "user cold start" scenario you described: you have lots of information about items but not about a particular user's preferences.
First, think about your end goal and the data you have. Based on your product needs and available data, you can choose the right algorithm for your purposes. I highly recommend the RecSys course on Coursera for learning more: https://www.coursera.org/learn/recommender-systems. It's taught by a leader in the field.

NoSQL for time series/logged instrument reading data that is also versioned

My Data
It's primarily monitoring data, passed in the form of Timestamp: Value, for each monitored value, on each monitored appliance. It's regularly collected over many appliances and many monitored values.
Additionally, it has the quirky feature of many of these data values being derived at the source, with the calculation changing from time to time. This means that my data is effectively versioned, and I need to be able to simply call up only data from the most recent version of the calculation. Note: This is not versioning where the old values are overwritten. I simply have timestamp cutoffs, beyond which the data changes its meaning.
My Usage
Downstream, I'm going to have various undefined data mining/machine learning uses for the data. It's not really clear yet what those uses are, but it is clear that I will be writing all of the downstream code in Python. Also, we are a very small shop, so I can really only deal with so much complexity in setup, maintenance, and interfacing to downstream applications. We just don't have that many people.
The Choice
I am not allowed to use a SQL RDBMS to store this data, so I have to find the right NoSQL solution. Here's what I've found so far:
Cassandra
Looks totally fine to me, but it seems like some of the major users have moved on. It makes me wonder if it's just not going to be that much of a vibrant ecosystem. This SE post seems to have good things to say: Cassandra time series data
Accumulo
Again, this seems fine, but I'm concerned that this is not a major, actively developed platform. It seems like this would leave me a bit starved for tools and documentation.
MongoDB
I have a, perhaps irrational, intense dislike for the Mongo crowd, and I'm looking for any reason to discard this as a solution. It seems to me like the data model of Mongo is all wrong for things with such a static, regular structure. My data even comes in (and has to stay in) order. That said, everybody and their mother seems to love this thing, so I'm really trying to evaluate its applicability. See this and many other SE posts: What NoSQL DB to use for sparse Time Series like data?
HBase
This is where I'm currently leaning. It seems like the successor to Cassandra with a totally usable approach for my problem. That said, it is a big piece of technology, and I'm concerned about really knowing what it is I'm signing up for, if I choose it.
OpenTSDB
This is basically a time-series specific database, built on top of HBase. Perfect, right? I don't know. I'm trying to figure out what another layer of abstraction buys me.
My Criteria
Open source
Works well with Python
Appropriate for a small team
Very well documented
Has specific features to take advantage of ordered time series data
Helps me solve some of my versioned data problems
So, which NoSQL database actually can help me address my needs? It can be anything, from my list or not. I'm just trying to understand what platform actually has code, not just usage patterns, that support my super specific, well understood needs. I'm not asking which one is best or which one is cooler. I'm trying to understand which technology can most natively store and manipulate this type of data.
Any thoughts?
It sounds like you are describing one of the most common use cases for Cassandra. Time series data in general is often a very good fit for the cassandra data model. More specifically many people store metric/sensor data like you are describing. See:
http://rubyscale.com/blog/2011/03/06/basic-time-series-with-cassandra/
http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra
http://engineering.rockmelt.com/post/17229017779/modeling-time-series-data-on-top-of-cassandra
As far as your concerns with the community I'm not sure what is giving you that impression, but there is quite a large community (see irc, mailing lists) as well as a growing number of cassandra users.
http://www.datastax.com/cassandrausers
Regarding your criteria:
Open source
Yes
Works well with Python
http://pycassa.github.com/pycassa/
Appropriate for a small team
Yes
Very well documented
http://www.datastax.com/docs/1.1/index
Has specific features to take advantage of ordered time series data
See above links
Helps me solve some of my versioned data problems
If I understand your description correctly you could solve this multiple ways. You could start writing a new row when the version changes. Alternatively you could use composite columns to store the version along with the timestamp/value pair.
I'll also note that Accumulo, HBase, and Cassandra all have essentially the same data model. You will still find small differences around the data model in regards to specific features that each database offers, but the basics will be the same.
The bigger difference between the three will be the architecture of the system. Cassandra takes its architecture from Amazon's Dynamo. Every server in the cluster is the same and it is quite simple to setup. HBase and Accumulo or more direct clones of BigTable. These have more moving parts and will require more setup/types of servers. For example, setting up HDFS, Zookeeper, and HBase/Accumulo specific server types.
Disclaimer: I work for DataStax (we work with Cassandra)
I only have experience in Cassandra and MongoDB but my experience might add something.
So your basically doing time based metrics?
Ok if I understand right you use the timestamp as a versioning mechanism so that you query per a certain timestamp, say to get the latest calculation used you go based on the metric ID or whatever and get ts DESC and take off the first row?
It sounds like a versioned key value store at times.
With this in mind I probably would not recommend either of the two I have used.
Cassandra is too rigid and it's too heirachal, too based around how you query to the point where you can only make one pivot of graph data from (I presume you would wanna graph these metrics) the columfamily which is crazy, hence why I dropped it. As for searching (which Facebook use it for, and only that) it's not that impressive either.
MongoDB, well I love MongoDB and I am an elite of the user group and it could work here if you didn't use a key value storage policy but at the end of the day if your mind is not set and you don't like the tech then let me be the very first to say: don't use it! You will be no good at a tech that you don't like so stay away from it.
Though I would picture this happening in Mongo much like:
{
_id: ObjectID(),
metricId: 'AvailableMessagesInQueue',
formula: '4+5/10.01',
result: NaN
ts: ISODate()
}
And you query for the latest version of your calculation by:
var results = db.metrics.find({ 'metricId': 'AvailableMessagesInQueue' }).sort({ ts: -1 });
var latest = results.getNext();
Which would output the doc structure you see above. Without knowing more of exactly how you wish to query and the general servera and app scenario etc thats the best I can come up with.
I fond this thread on HBase though: http://mail-archives.apache.org/mod_mbox/hbase-user/201011.mbox/%3C5A76F6CE309AD049AAF9A039A39242820F0C20E5#sc-mbx04.TheFacebook.com%3E
Which might be of interest, it seems to support the argument that HBase is a good time based key value store.
I have not personally used HBase so do not take anything I say about it seriously....
I hope I have added something, if not you could try narrowing your criteria so we can answer more dedicated questions.
Hope it helps a little,
Not a plug for any particular technology but this article on Time Series storage using MongoDB might provide another way of thinking about the storage of large amounts of "sensor" data.
http://www.10gen.com/presentations/mongodc-2011/time-series-data-storage-mongodb
Axibase Time-Series Database
Open source
There is a free Community Edition
Works well with Python
https://github.com/axibase/atsd-api-python. There are also other language wrappers, for example ATSD R client.
Appropriate for a small team
Built-in graphics and rule engine make it productive for building an in-house reporting, dashboarding, or monitoring solution with less coding.
Very well documented
It's hard to beat IBM redbooks, but we're trying. API, configuration, and administration is documented in detail and with examples.
Has specific features to take advantage of ordered time series data
It's a time-series database from the ground-up so aggregation, filtering and non-parametric ARIMA and HW forecasts are available.
Helps me solve some of my versioned data problems
ATSD supports versioned time-series data natively in SE and EE editions. Versions keep track of status, change-time and source changes for the same timestamp for audit trails and reconciliations. It's a useful feature to have if you need clean, verified data with tracing. Think energy metering, PHMR records. ATSD schema also supports series tags, which you could use to store versioning columns manually if you're on CE edition or you need to extend default versioning columns: status, source, change-time.
Disclosure - I work for the company that develops ATSD.