Can I use Apache Mahout Taste for User Preferences matching? - preferences

I am trying to match objects based on predefined user preferences. A simple example would be finding best matching vechicle.
Lets say a user 'Tom' is offered a rented vehicle for travel based on his predefined preferences. In this case, the predefined user preferences will be -
** Pre-defined user preferences for Tom:
PreferredVehicle (Make='ANY', Type='3-wheeler/4-wheeler',
Category='Sedan/Hatchback', AC/Non-AC='AC')
** while the 10 available vehicles are -
Vechile1(Make='Toyota', Type='4-wheeler', Category='Hatchback', AC/Non-AC='AC')
Vechile2(Make='Tata', Type='3-wheeler', Category='Transport', AC/Non-AC='Non-AC')
Vechile3(Make='Honda', Type='4-wheeler', Category='Sedan', AC/Non-AC='AC')
;
;
and so on upto 'Vehicle10'
All I want to do is - choose a vehicle for Tom that best matches his preferences and also probably give him choices in order, i.e. best match first.
Questions I have :
Can this be done with Mahout Taste?
If yes, can someone please point me to some example code where I can start quickly?

A recommender may not be the best tool for the job here, for a few reasons. First, I don't expect that the best answers are all that personal in this domain. If I wanted a Ford Focus, the best alternative you have is likely about the same for most every user. Second, there is not much of a discovery problem here. I'm searching for a vehicle that meets certain needs; I don't particularly want or need to find new and unknown vehicles, like I would for music. Finally you don't have much data per user; I assume most users have never rented before, and very few have even 3+ rentals.
Can you throw this data at a recommender anyway? Sure, try Mahout Taste (I'm the author). If you have the book Mahout in Action it will walk you through it. Since it's non-rating data, I can also recommend the successor project, Myrrix (http://myrrix.com) as it will be easier to set up and run. You can at least evaluate the results to see if it's anywhere near useful.
Either way, your work will just be to make a CSV file of "userID,vehicleID" pairs from your data and feed it in. Then it will give you vehicle IDs as recommendations for any user ID.
But, I imagine you will do much better to analyze what people picked when the car wasn't available, and look at the difference, and learn which attributes they are most and least likely to be sacrificed, and learn to score the alternatives that way. This is entirely feasible since this data set is small, and because you have rich item attribute data.

Related

How to implement Associative Rules Analysis or Market Basket Analysis from scratch?

I tried to went through numerous articles trying to understand what should be my first step to incorporate associative analysis (may be Market Basket analysis) into my system. They all go deep into implementation of algorithm but no one talked about how to store data in the first place.
I will really appreciate if someone can give me some start pointers or article links that I can begin with.
The first thing I want to implement is to track user clicks and provide suggestions based on tracked data.
E.g. User clicked on link A and subsequently on link B and link C. I can track this activity with some metadata associated (user, user organization, user role etc.)
I do not want it to be limited only to links. In future, I want to add number of similar usecases into the system and want to make it smart. E.g. If user set specific values for fields A and B, most likely he/she will set value <bla> for field C.
My system may generate several thousand such data points in a day (E.g. user clicks, field selection etc.).
Below are my questions:
How should I store my data? Go SQL or No SQL (I briefly looked into Mongo DB and it looked promising)
What tool should I use to perform the associative analysis? Are there any open source tools I can use?
It depend. Does your data suitable for NoSql databases? To answer this question it's better to read CAP Theorem and it's case studies: https://en.wikipedia.org/wiki/CAP_theorem or http://robertgreiner.com/2014/06/cap-theorem-explained/
. Some time you want Consistency(depending to your data) and Availability => so that it's better to use Relational Databases like Mysql(Try to read case studies and analyse your data to pick the best tools)
There is large number of open source libraries, but in my opinion it's better to first read some concepts and algorithms. Try searching for Apriori,ECLAT, FP-GROWTH Algorithms and get concepts of them. then you can pick a tool or write the code your self. Some usefull tools(depending to your programming language):
Python: https://github.com/asaini/Apriori, https://github.com/enaeseth/python-fp-growth, https://github.com/enaeseth/python-fp-growth/blob/master/fp_growth.py
PHP: https://github.com/sigidhanafi/fp-growth-php
JAVA: https://github.com/goodinges/FP-Growth-Java, http://www.philippe-fournier-viger.com/spmf/
Also you can use Spark: https://spark.apache.org/docs/1.1.1/mllib-guide.html

word suggestion based on input algorithm?

I am thinking of creating a web site, which lets people to rate restaurants. Since I don't have a database containing all the restaurants, this web site relies on user's inputs.
But there is a problem of this method, because people may use different word (name) to describe a same restaurant, but I don't want to create different entries inside the database, as they refer to the same restaurant.
For example, when describing KFC, somebody use the name "KFC", others may use "Kentucky Fried Chicken"
How can I make the system to automatically detect this? and give the user a list of existing items of the database.
This should quite similar to stackoverflow, which tells you "questions with similar title". But I don't know how to implement this.
You can't ... you have to create a list of the restaurant names and their "synonyms" and other possible spellings.
How can I make the system to automatically detect this?
The system doesn't know that "KFC" means "Kentucky Fried Chicken".
Make a map of synonyms, to let it know.
This should quite similar to stackoverflow, which tells you "questions with similar title"
It generally matches word-for-word. It may have an internal list of synonyms to deal with common cases.

How do I adapt my recommendation engine to cold starts?

I am curious what are the methods / approaches to overcome the "cold start" problem where when a new user or an item enters the system, due to lack of info about this new entity, making recommendation is a problem.
I can think of doing some prediction based recommendation (like gender, nationality and so on).
You can cold start a recommendation system.
There are two type of recommendation systems; collaborative filtering and content-based. Content based systems use meta data about the things you are recommending. The question is then what meta data is important? The second approach is collaborative filtering which doesn't care about the meta data, it just uses what people did or said about an item to make a recommendation. With collaborative filtering you don't have to worry about what terms in the meta data are important. In fact you don't need any meta data to make the recommendation. The problem with collaborative filtering is that you need data. Before you have enough data you can use content-based recommendations. You can provide recommendations that are based on both methods, and at the beginning have 100% content-based, then as you get more data start to mix in collaborative filtering based.
That is the method I have used in the past.
Another common technique is to treat the content-based portion as a simple search problem. You just put in meta data as the text or body of your document then index your documents. You can do this with Lucene & Solr without writing any code.
If you want to know how basic collaborative filtering works, check out Chapter 2 of "Programming Collective Intelligence" by Toby Segaran
Maybe there are times you just shouldn't make a recommendation? "Insufficient data" should qualify as one of those times.
I just don't see how prediction recommendations based on "gender, nationality and so on" will amount to more than stereotyping.
IIRC, places such as Amazon built up their databases for a while before rolling out recommendations. It's not the kind of thing you want to get wrong; there are lots of stories out there about inappropriate recommendations based on insufficient data.
Working on this problem myself, but this paper from microsoft on Boltzmann machines looks worthwhile: http://research.microsoft.com/pubs/81783/gunawardana09__unified_approac_build_hybrid_recom_system.pdf
This has been asked several times before (naturally, I cannot find those questions now :/, but the general conclusion was it's better to avoid such recommendations. In various parts of the worls same names belong to different sexes, and so on ...
Recommendations based on "similar users liked..." clearly must wait. You can give out coupons or other incentives to survey respondents if you are absolutely committed to doing predictions based on user similarity.
There are two other ways to cold-start a recommendation engine.
Build a model yourself.
Get your suppliers to fill in key information to a skeleton model. (Also may require $ incentives.)
Lots of potential pitfalls in all of these, which are too common sense to mention.
As you might expect, there is no free lunch here. But think about it this way: recommendation engines are not a business plan. They merely enhance the business plan.
There are three things needed to address the Cold-Start Problem:
The data must have been profiled such that you have many different features (with product data the term used for 'feature' is often 'classification facets'). If you don't properly profile data as it comes in the door, your recommendation engine will stay 'cold' as it has nothing with which to classify recommendations.
MOST IMPORTANT: You need a user-feedback loop with which users can review the recommendations the personalization engine's suggestions. For example, Yes/No button for 'Was This Suggestion Helpful?' should queue a review of participants in one training dataset (i.e. the 'Recommend' training dataset) to another training dataset (i.e. DO NOT Recommend training dataset).
The model used for (Recommend/DO NOT Recommend) suggestions should never be considered to be a one-size-fits-all recommendation. In addition to classifying the product or service to suggest to a customer, how the firm classifies each specific customer matters too. If functioning properly, one should expect that customers with different features will get different suggestions for (Recommend/DO NOT Recommend) in a given situation. That would the 'personalization' part of personalization engines.

Essential techniques for pinpointing missing requirements?

An initial draft of requirements specification has been completed and now it is time to take stock of requirements, review the specification. Part of this process is to make sure that there are no sizeable gaps in the specification. Needless to say that the gaps lead to highly inaccurate estimates, inevitable scope creep later in the project and ultimately to a death march.
What are the good, efficient techniques for pinpointing missing and implicit requirements?
This question is about practical techiniques, not general advice, principles or guidelines.
Missing requirements is anything crucial for completeness of the product or service but not thought of or forgotten about,
Implicit requirements are something that users or customers naturally assume is going to be a standard part of the software without having to be explicitly asked for.
I am happy to re-visit accepted answer, as long as someone submits better, more comprehensive solution.
Continued, frequent, frank, and two-way communication with the customer strikes me as the main 'technique' as far as I'm concerned.
It depends.
It depends on whether you're being paid to deliver what you said you'd deliver or to deliver high quality software to the client.
If the former, simply eliminate ambiguity from the specifications and then build what you agreed to. Try to stay away from anything not measurable (like "fast", "cool", "snappy", etc...).
If the latter, what Galwegian said + time or simply cut everything not absolutely drop-dead critical and build that as quickly as you can. Production has a remarkable way of illuminating what you missed in Analysis.
evaluate the lifecycle of the elements of the model with respect to a generic/overall model such as
acquisition --> stewardship --> disposal
do you know where every entity comes from and how you're going to get it into your system?
do you know where every entity, once acquired, will reside, and for how long?
do you know what to do with each entity when it is no longer needed?
for a more fine-grained analysis of the lifecycle of the entities in the spec, make a CRUDE matrix for the major entities in the requirements; this is a matrix with the operations/applications as the rows and the entities as the columns. In each cell, put a C if the application Creates the entity, R for Reads, U for Updates, D for Deletes, or E for "Edits"; 'E' encompasses C,R,U, and D (most 'master table maintenance' apps will be Es). Then check each column for C,R,U, and D (or E); if one is missing (except E), figure out if it is needed. The rows and columns of the matrix can be rearranged (manually or using affinity analysis) to form cohesive groups of entities and applications which generally correspond to subsystems; this may assist with physical system distribution later.
It is also useful to add a "User" entity column to the CRUDE matrix and specify for each application (or feature or functional area or whatever you want to call the processing/behavioral aspects of the requirements) whether it takes Input from the user, produces Output for the user, or Interacts with the user (I use I, O, and N for this, and always make the User the first column). This helps identify where user-interfaces for data-entry and reports will be required.
the goal is to check the completeness of the specification; the techniques above are useful to check to see if the life-cycle of the entities are 'closed' with respect to the entities and applications identified
Here's how you find the missing requirements.
Break the requirements down into tiny little increments. Really small. Something that can be built in two weeks or less. You'll find a lot of gaps.
Prioritize those into what would be best to have first, what's next down to what doesn't really matter very much. You'll find that some of the gap-fillers didn't matter. You'll also find that some of the original "requirements" are merely desirable.
Debate the differences of opinion as to what's most important to the end users and why. Two users will have three opinions. You'll find that some users have no clue, and none of their "requirements" are required. You'll find that some people have no spine, and things they aren't brave enough to say out loud are "required".
Get a consensus on the top two or three only. Don't argue out every nuance. It isn't possible to envision software. It isn't possible for anyone to envision what software will be like and how they will use it. Most people's "requirements" are descriptions of how the struggle to work around the inadequate business processes they're stuck with today.
Build the highest-priority, most important part first. Give it to users.
GOTO 1 and repeat the process.
"Wait," you say, "What about the overall budget?" What about it? You can never know the overall budget. Do the following.
Look at each increment defined in step 1. Provide a price-per-increment. In priority order. That way someone can pick as much or as little as they want. There's no large, scary "Big Budgetary Estimate With A Lot Of Zeroes". It's all negotiable.
I have been using a modeling methodology called Behavior Engineering (bE) that uses the original specification text to create the resulting model when you have the model it is easier to identify missing or incomplete sections of the requirements.
I have used the methodolgy on about six projects so far ranging from less than a houndred requirements to over 1300 requirements. If you want to know more I would suggest going to www.behaviorengineering.org there some really good papers regarding the methodology.
The company I work for has created a tool to perform the modeling. The work rate to actually create the model is about 5 requirements for a novice and an expert about 13 requirements an hour. The cool thing about the methodolgy is you don't need to know really anything about the domain the specification is written for. Using just the user text such as nouns and verbs the modeller will find gaps in the model in a very short period of time.
I hope this helps
Michael Larsen
How about building a prototype?
While reading tons of literature about software requirements, I found these two interesting books:
Problem Frames: Analysing & Structuring Software Development Problems by Michael Jackson (not a singer! :-).
Practical Software Requirements: A Manual of Content and Style by Bendjamen Kovitz.
These two authors really stand out from the crowd because, in my humble opinion, they are making a really good attempt to turn development of requirements into a very systematic process - more like engineering than art or black magic. In particular, Michael Jackson's definition of what requirements really are - I think it is the cleanest and most precise that I've ever seen.
I wouldn't do a good service to these authors trying to describe their aproach in a short posting here. So I am not going to do that. But I will try to explain, why their approach seems to be extremely relevant to your question: it allows you to boil down most (not all, but most!) of you requirements development work to processing a bunch of check-lists* telling you what requirements you have to define to cover all important aspects of the entire customer's problem. In other words, this approach is supposed to minimize the risk of missing important requirements (including those that often remain implicit).
I know it may sound like magic, but it isn't. It still takes a substantial mental effort to come to those "magic" check-lists: you have to articulate the customer's problem first, then analyze it thoroughly, and finally dissect it into so-called "problem frames" (which come with those magic check-lists only when they closely match a few typical problem frames defined by authors). Like I said, this approach does not promise to make everything simple. But it definitely promises to make requirements development process as systematic as possible.
If requirements development in your current project is already quite far from the very beginning, it may not be feasible to try to apply the Problem Frames Approach at this point (although it greatly depends on how your current requirements are organized). Still, I highly recommend to read those two books - they contain a lot of wisdom that you may still be able to apply to the current project.
My last important notes about these books:
As far as I understand, Mr. Jackson is the original author of the idea of "problem frames". His book is quite academic and theoretical, but it is very, very readable and even entertaining.
Mr. Kovitz' book tries to demonstrate how Mr. Jackson ideas can be applied in real practice. It also contains tons of useful information on writing and organizing the actual requirements and requirements documents.
You can probably start from the Kovitz' book (and refer to Mr. Jackson's book only if you really need to dig deeper on the theoretical side). But I am sure that, at the end of the day, you should read both books, and you won't regret that. :-)
HTH...
I agree with Galwegian. The technique described is far more efficient than the "wait for customer to yell at us" approach.

How to address semantic issues with tag-based web sites

Tag-based web sites often suffer from the delicacy of language such as synonyms, homonyms, etc. For programmers looking for information, say on Stack Overflow, concrete examples are:
Subversion or SVN (or svn, with case-sensitive tags)
.NET or Mono
[Will add more]
The problem is that we do want to preserve our delicacy of language and make the machine deal with it as good as possible.
A site like del.icio.us sees its tag base grow a lot, thus probably hindering usage or search. Searching for SVN-related entries will probably list a majority of entries with both subversion and svn tags, but I can think of three issues:
A search is incomplete as many entries may not have both tags (which are 'synonyms').
A search is less useful as Q/A often lead to more Qs! Notably for newbies on a given topic.
Tagging a question (note: or an answer separately, sounds useful) becomes philosophical: 'Did I Tag the Right Way?'
One way to address these issues is to create semantic links between tags, so that subversion and SVN are automatically bound by the system, not by poor users.
Is it an approach that sounds good/feasible/attractive/useful? How to implement it efficiently?
Recognizing synonyms and semantic connections is something that humans are good at; a solution to organizing an open-ended taxonomy like what SO is featuring would probably be well served by finding a way to leave the matching to humans.
One general approach: someone (or some team) reviews new tags on a daily basis. New synonyms are added to synonym groups. Searches hit synonym groups (or, more nuanced, hit either literal matches or synonym group matches according to user preference).
This requires support for synonym groups on the back end (work for the dev team). It requires a tag wrangler or ten (work for the principals or for trusted users). It doesn't require constant scaling, though—the rate at which the total tag pool grows will likely (after the initial Here Comes Everybody bump of the open beta) will in all likelihood decrease over time, as any organic lexicon's growth-rate does.
Synonymy strikes me as the go-to issue. Hierarchical mapping is an ambitious and more complicated issue; it may be worth it or it may not be, but given the relative complexity of defining the hierarchy it'd probably be better left as a Phase 2 to any potential synonym project's Phase 1.
The way the software on blogspot.com is set up, is that there is an ajax-autocomplete-thingie on the box where you write the name of the tags. This searches all your previous posts for tags that start with the same letters. At least that way you catch different casings and spellings (but not synonyms).
How would the system know which tags to semantically link? Would it keep an ever-growing map of tags? I can't see that working. What if someone typed sbversion instead? How would that get linked?
I think that asking the user when they submit tags could work. For example, "You've entered the following tags: sbversion, pascal and bindings. Did you mean, "Subversion", "Pascal" and "Bindings"?
Obviously the system would have to have a fairly smart matching system for that to work. Doing it this way would be extra input for the user (which'd probably annoy them) but the human input would, if done correctly, make for less duplicate tags.
In fact, having said all that, the system could use the results of the user's input as a basis for automatic tag matching. From the previous example, someone creates a tag of "sbversion" and when prompted changes it to "Subversion" - the system could learn that and do it automatically next time.
Part of the issue you're looking at is that English is rife with synonyms - are the following different: build-management, subversion, cvs, source-control?
Maybe, maybe not. Having a system, like the one [now] in use on SO that brings up the tag you probably meant is extremely helpful. But it doesn't stop people from bulling-through the tagging process.
Maybe you could refuse to accept "new" tags without a user-interaction? Before you let 'sbversion' go in, force a spelling check?
This is definitely an interesting problem. I asked an open question similar to this on my blog last year. A couple of the responses were quite insightful.
I completely agree. The mass of tags that have currently. I don't participate in other tagged based sites. However having a hierarchy of tags would be very helpful, instead of ruby rails ruby-on-rails rubyonrails etc...
Tags are basically our admission that search algorithms aren't up to snuff. If we can get a computer to be smart enough to identify that things tagged "Subversion" have similar content to things tagged "svn", presumably we can parse the contents, so why not skip tags altogether, and match a search term directly to the content (i.e., autotagging, which is basically mapping keywords to results)?!
The problem is to make the search engine use the fact that 'subversion' and 'svn' are very similar to the point that they mean the same 'thing'.
It might be attractive to compute a simple similarity between tags based on frequency: 'subversion' and 'svn' appear very often together, so requesting 'svn' would return SVN-related questions, but also the rare questions only tagged 'subversion' (and vice versa). However, 'java' and 'c#' also appear often together, but for very different reasons (they are not synonyms). So similarity based on frequency is out.
An answer to this problem might be a mix of mechanisms, as the ones suggested in this Q/A thread:
Filtering out typos by suggesting tags when the user inputs them.
Maintaining a user-generated map of synonyms. This map may not be that big if it just targets synonyms.
Allowing multi-tag search, such that the user can put 'subversion svn' or 'subversion && svn' (well, from programmers to programmers) in the search box and get both. This would be quite practical as many users may actually try such approach when they do not know which term is the most meaningful.
#Nick: Agreed. The question is not meant to argue against tags. Tags have great potential, but users will face a growing issue if one cannot search 'across' tags.
#Steve: Maintaining an ever-growing map of tags is definitely not practical. As SO is accumulating an ever-growing bag of tags, how could we shade some light on this bag to make search of Q/A tags even more useful, in a convenient way?
#Espo: 'Ajax-powered' tag suggestions based on existing tags is apparently available on SO when creating a question. This is by the way very helpful to choose tags and appropriate spelling (avoiding the 'subversion' vs. 'sbversion' issue from Steve).