how to add user and item metadata in recommendation engine and which python open source tool can give this features - recommendation-engine

Ex : i have a master file like
userid itemid rating
1 2 5
another user file, where user related metadata present, metadata could be many :
userid age
1 5
2 8
also i have a item file, where item related metadata present. meta data could be many
itemid item_catagory item_geo
1 5 india
if i will recommend any item to a user, i want to include these meta information.
i want to know whether matrix factorisation will be helpful and which python open source module has this kind of implementation available.

To include additional information to recommender system e.g. about an item, user or a recommendation event one of the existing methods for Context Aware Recommender System (CARS) can be applied.
Try starting with Factorization Machines. You can convert Your data to svmlight format and go with reference implementation libfm. It has the best documentation and there is a similar example where you can find how to prepare Your data. If it has to be Python, there are also many existing implementation, e.g. fastFM Simpler - pyFM. If You want to try Julia, maybe You can use my code of FactorizationMachines.jl as a base :-)
If You will be comfortable with standard FM, check out: LightFM and Field Aware FM
I think FM are one of the most successful and easy to use models for CARS. Please give some feedback how it works for You. If You need different approach I will be happy to help.

Related

TYPO3 backend workflow when avoiding the storage of data in intermediate table

I have a situation as described in the ExtbaseFluid book:
I would like to store information in the intermediate table which is not recommended at all.
Here is a cite from the warning box of the above linked book chapter:
Do not store data in the Intermediate Table that concern the Domain. Though TYPO3 supports this (especially in combination with Inline Relational Record Editing (IRRE) but this is always a sign that further improvements can be made to your Domain Model. Intermediate Tables are and should always be tools for storing relationships and nothing else.
Let’s say you want to store a CD with its containing music tracks: CD -- m:n (Intermediate Table) -- Song. The track number may be stored in a field of the Intermediate Table. However, the track should be stored as a separate domain object, and the connection be realized as CD -- 1:n -- Track -- n:1 -- Song.
So I want not to do what is not recommended. But thinking about the workflow for the editor that results of the recommended solution rises a few question for me.
To stay with this example I would need the following tables:
tx_extname_domain_model_cd
tx_extname_domain_model_cd_track_mm
tx_extname_domain_model_track (which holds the track number)
tx_extname_domain_model_track_song_mm
tx_extname_domain_model_song
From what I know this would end in the situation that the editor would need to create following records:
one record for the cd
one record for the song
now the editor can create one record for the track.
There the track number is added.
Furthermore the cd record needs to be assigned as well as the song.
So here are my questions:
I guess this workflow cannot be improved with some (to me unknown) TCA setup?
An editor cannot directly reach the song when the cd record is opened?
Instead first she / he has to open the track record and can from there navigate to the song?
Is it really that bad to store data in the intermediate table? The TYPO3 table sys_file_reference does the same!? But I wonder how those data could be shown (because IRRE is not possible because it shall only be used for 1:n relations (source).
The question you have to ask yourself is: Do I want to do coding by the book, or do I want to create a pragmatic approach to solve a customer's problem?
In this specific case the additional problem is, that the people who originally invented Extbase had a quite sophisticated and academic approach, but when it comes to a pragmatic use and performance, they were blocked by their own rules and stuck with coding by the book.
Especially this example and the warning message shows a way of thinking that was one of the reasons, why I never actually used Extbase but went for Core-API methods to create performant and pragmatic queries to get the desired result sets. Now that we've got Doctrine under the hood, this works like a charm even with quite exotic DB flavors.
Of course intermediate tables are a good idea and of course those intermediate tables can and should be enriched with additional data fields, that do not require a 3rd, 4th or nth table to store i.e. a simple set of dropdown options, since this can easily be handled with attributes configured in TCA, as it is shown here: https://docs.typo3.org/m/typo3/reference-tca/master/en-us/ColumnsConfig/Type/Inline/Examples.html
sys_file_reference is the most prominent example since it provides exactly that kind of additional information that should not be pumped into additional tables - and guess what, the TYPO3 core does not make use of a single line of Extbase code to deal with that data or almost any other data of the core tables.
To answer your last question: Take a look at the good old IRRE Tutorial to get a clue how to do m:n connections with intermediate inline tables.
https://docs.typo3.org/typo3cms/extensions/irre_tutorial/0.4.0/Manual/Index.html#intermediate-tables-for-m-n-relations
Depends on the issue, sometimes the intermediate table is an entity, sometimes not. In this example the intermediate table is the track, which would contain: [uid, cd, song, track_no, ... (whatever else needed to define the track)]
Be carefull when you define your data, that you do not make it too advanced.

How to implement Associative Rules Analysis or Market Basket Analysis from scratch?

I tried to went through numerous articles trying to understand what should be my first step to incorporate associative analysis (may be Market Basket analysis) into my system. They all go deep into implementation of algorithm but no one talked about how to store data in the first place.
I will really appreciate if someone can give me some start pointers or article links that I can begin with.
The first thing I want to implement is to track user clicks and provide suggestions based on tracked data.
E.g. User clicked on link A and subsequently on link B and link C. I can track this activity with some metadata associated (user, user organization, user role etc.)
I do not want it to be limited only to links. In future, I want to add number of similar usecases into the system and want to make it smart. E.g. If user set specific values for fields A and B, most likely he/she will set value <bla> for field C.
My system may generate several thousand such data points in a day (E.g. user clicks, field selection etc.).
Below are my questions:
How should I store my data? Go SQL or No SQL (I briefly looked into Mongo DB and it looked promising)
What tool should I use to perform the associative analysis? Are there any open source tools I can use?
It depend. Does your data suitable for NoSql databases? To answer this question it's better to read CAP Theorem and it's case studies: https://en.wikipedia.org/wiki/CAP_theorem or http://robertgreiner.com/2014/06/cap-theorem-explained/
. Some time you want Consistency(depending to your data) and Availability => so that it's better to use Relational Databases like Mysql(Try to read case studies and analyse your data to pick the best tools)
There is large number of open source libraries, but in my opinion it's better to first read some concepts and algorithms. Try searching for Apriori,ECLAT, FP-GROWTH Algorithms and get concepts of them. then you can pick a tool or write the code your self. Some usefull tools(depending to your programming language):
Python: https://github.com/asaini/Apriori, https://github.com/enaeseth/python-fp-growth, https://github.com/enaeseth/python-fp-growth/blob/master/fp_growth.py
PHP: https://github.com/sigidhanafi/fp-growth-php
JAVA: https://github.com/goodinges/FP-Growth-Java, http://www.philippe-fournier-viger.com/spmf/
Also you can use Spark: https://spark.apache.org/docs/1.1.1/mllib-guide.html

data mining project Dilemma

I research a set of data, consisting of two data files:
The first contains user id id artists and ranking of users for artists that want to rank.
The second data file contains id and name artists
I have chosen research question which is:
Is the artist is Popular or not?
In other words,by given the new singer, who will not found in the data file, using algorithms, we will classify it as an artist and to know if it is a popular or not.
For Prediction step I chose to use logistic regression method
But my problem is earlier.
I do not know how, technically, to determine who from the existing data will be defined as successful as an artist who is unsuccessful.
I thought of some methods, for example:k-means with k=2 (but in this method i have a problem with function disance),knn with k=2 etc.
I need guidance ,refers to how i will make to clustering to the Existing data
and general tips to the project.
thank you.

Can I use Apache Mahout Taste for User Preferences matching?

I am trying to match objects based on predefined user preferences. A simple example would be finding best matching vechicle.
Lets say a user 'Tom' is offered a rented vehicle for travel based on his predefined preferences. In this case, the predefined user preferences will be -
** Pre-defined user preferences for Tom:
PreferredVehicle (Make='ANY', Type='3-wheeler/4-wheeler',
Category='Sedan/Hatchback', AC/Non-AC='AC')
** while the 10 available vehicles are -
Vechile1(Make='Toyota', Type='4-wheeler', Category='Hatchback', AC/Non-AC='AC')
Vechile2(Make='Tata', Type='3-wheeler', Category='Transport', AC/Non-AC='Non-AC')
Vechile3(Make='Honda', Type='4-wheeler', Category='Sedan', AC/Non-AC='AC')
;
;
and so on upto 'Vehicle10'
All I want to do is - choose a vehicle for Tom that best matches his preferences and also probably give him choices in order, i.e. best match first.
Questions I have :
Can this be done with Mahout Taste?
If yes, can someone please point me to some example code where I can start quickly?
A recommender may not be the best tool for the job here, for a few reasons. First, I don't expect that the best answers are all that personal in this domain. If I wanted a Ford Focus, the best alternative you have is likely about the same for most every user. Second, there is not much of a discovery problem here. I'm searching for a vehicle that meets certain needs; I don't particularly want or need to find new and unknown vehicles, like I would for music. Finally you don't have much data per user; I assume most users have never rented before, and very few have even 3+ rentals.
Can you throw this data at a recommender anyway? Sure, try Mahout Taste (I'm the author). If you have the book Mahout in Action it will walk you through it. Since it's non-rating data, I can also recommend the successor project, Myrrix (http://myrrix.com) as it will be easier to set up and run. You can at least evaluate the results to see if it's anywhere near useful.
Either way, your work will just be to make a CSV file of "userID,vehicleID" pairs from your data and feed it in. Then it will give you vehicle IDs as recommendations for any user ID.
But, I imagine you will do much better to analyze what people picked when the car wasn't available, and look at the difference, and learn which attributes they are most and least likely to be sacrificed, and learn to score the alternatives that way. This is entirely feasible since this data set is small, and because you have rich item attribute data.

How to decide whether to use a RDBMS, Doc/Obj ODBMS or Graph?

What I intend to design basically boils down to a list of users, organisations, events, addresses and comments which could quite easily be maintained in a RDBMS such as MySQL. However, if the project takes off I want to add another aspect which is resources - i.e. files, videos, images etc which can belong to either a user, organisation or event. This instantly raises the question of whether to use a RDBMS and store a reference to an external file through a table related to each of the categories previously mentioned or whether to use a Doc/Obj ODBMS such as MongoDB to store these items.
But I also want to be able to link users, organisations and events. i.e. User A owns Org 1 and Org 2. User B owns Org 3 and Org 4. User C owns Org 5. Org 1 has an Event X, held at Addr M on Date R, which Org 3 will also be at. User C intends to attend Event X. Org 2 also has an Event Y at Addr M but on Date T. etc etc. As such, I would suspect that a Graph DBMS such as OrientDB would be the best solution. Either that, or I would have a lot of tables in a RDBMS with a lot of joins, and potentially a lot of queries, or a very strange structure in a Doc/Obj DBMS.
I've looked at InfoGrid, which is a Graph database that can connect to MySQL, which could be a potential way to skin this cat. Has anybody else attempted anything like this? What are your thoughts on how to implement such a system, which needs to be scalable? Suggestions are greatly appreciated.
Your description lends itself to a relational model. RDBMS for this particular setup is the proper way to go.