aws-presonalize: can I get recommendations on items not seen in training based on item features? - recommendation-engine

I consider using aws personalize, or any similar managed recommendation service.
My question is whether it is possible to get recommendations/rankings on items that were not seen in the training data, based on item features. I see that aws personalize does have item feature dataset, but when I read the documentation about ranking recipe it specifically says that items not in the training are added at the end of any ranking. of course - new items have no interaction data, so any recipe/algorithm that solely relies on interaction data is not relevant for my case.
My question is, whether and how can I utilize aws personalize to my use case, if at all possible, or whether you know of any recommender service that can handle it.

Yes. There are specific Amazon Personalize recipes designed to support cold starting items where a cold item is one without behavioral data in the interactions dataset but with item metadata in the items dataset.
The User-Personalization recipe supports cold starting items through a feature called exploration. You control how much exploration (i.e., recommending cold items) is done with the explorationWeight inference hyperparameter when creating a Personalize campaign or batch inference job. See this blog post for details.
Exploration also applies to domain recommenders for the Top picks for you VOD recommender and Recommended for you e-commerce recommender. You specify the explorationWeight when creating a recommender.
The Similar-Items recipe supports the related items use case and looks to balance recommending similar items based on behavioral data and thematic similarity between items. You currently cannot control the weighting with this recipe, though. See this blog post for details. The More like X VOD recommender provides similar functionality.

Related

How to prioritize the keywords in recommendation engine

I am building a recommendation engine for Food products.
To recommend the best products matching the user's preferences, I would like to improve the engine with prioritization algorithms/ methods.
For example, when a user set his preference as 'Organic' and 'Nut Free', the recommendation engine generates the list of the products based on those preferences keywords. so far I have made it working based on TF-IDF distance and collocation.
However, I have no idea how I can weigh the preferences to get more appropriate results.
What is the best way to prioritize the keyword inside of the recommendation engine?
In short, I want to get the result of having more 'Organic' over 'Nut Free' in this example.
Cheers

Recommending items with availability date in Amazon Personalize

Amazon Personalize builds a recommendation model taking into account users, items and events. However, items are assumed as available, and this might not be the case for certain scenarios.
If items need to reflect a time window like availability in time (from date, to date), then you should be able to offer only valid items according to that restriction.
For instance, this would be the case for live shows: you should only recommend live shows that will happen in the future, either based on similarity or community behaviour. Live shows that already happened are part of the training, but are not valid products to recommend.
How can you model this availability restriction in Amazon Personalize?
There are a variety of business requirements that you may need to handle, that are not built in to the core of Amazon Personalize, and this is one of those. For these business requirements you need to build the logic into your wrapper around Amazon Personalize.
Edit: Personalize now allows you to filter recommendations on item metadata which looks like it would be sufficient for this use case. See the writeup here: https://aws.amazon.com/blogs/machine-learning/enhancing-recommendation-filters-by-filtering-on-item-metadata-with-amazon-personalize/

Can Continuous Views be reinitialized efficiently?

I'm new to PipelineDB and have yet to even experience it at runtime (installation pending ...). But I'm reading over the documentation and I'm totally intrigued.
Apparently, PipelineDB is able to take set-based query representations and mechanically transform them into an incremental representation for efficiently processing the streams of deltas with storage limited as a function of the output of the continuous view.
Is it also supported to run the set-based query as a set-based query for priming a continuous view? It seems to me that upon creation of a Continuous View the initial data would be computed this traditional way. Also, since Continuous Views can be truncated, can they then be repopulated (from still-available source tables) without tearing down whatever dependent objects it has to allow a drop/create?
It seems to me that this feature would be critical in many practical scenarios. One easy example would be refreshing occasionally to reset the drift from rounding errors in, say, fractional averages.
Another example is if there were bug discovered and fixed in PipelineDB itself which had caused errors in the data. After the software is patched, the queries based on data still available ought to be rerun.
Continuous Views based fully on event streams with no permanent storage could not be rebuilt in that way. Not sure about if only part of the join sources are ephemeral.
I don't see these topics covered in the docs. Can you explain how these are or aren't a concern?
Thanks!
Jeff from PipelineDB here.
The main answer to your question is covered in the introduction section of the PipelineDB technical docs:
"PipelineDB can dramatically reduce the amount of information that needs to be persisted to disk because only the output of continuous queries is stored. Raw data is discarded once it has been read by the continuous queries that need to read it."
While continuous views only store the output of continuous queries, almost everybody who is using PipelineDB is storing their raw data somewhere cheap like S3. PipelineDB is meant to be the realtime analytics layer that powers things like realtime reporting applications and realtime monitoring & alerting systems, used almost always in conjunction with other systems for data infrastructure.
If you're interested in PipelineDB you might also want to check out the new realtime analytics API product we recently rolled out called Stride. The Stride API gives developers the benefit of continuous SQL queries, integrated storage, windowed queries, and other things like realtime webhooks, all without having to manage any underlying data infrastructure, all via a simple HTTP API.
If you have any additional technical questions you can always find our open-source users and dev team hanging out in our gitter chat channel.

MongoDB + Neo4J vs OrientDB vs ArangoDB [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
I am currently on design phase of a MMO browser game, game will include tilemaps for some real time locations (so tile data for each cell) and a general world map. Game engine I prefer uses MongoDB for persistent data world.
I will also implement a shipping simulation (which I will explain more below) which is basically a Dijkstra module, I had decided to use a graph database hoping it will make things easier, found Neo4j as it is quite popular.
I was happy with MongoDB + Neo4J setup but then noticed OrientDB , which apparently acts like both MongoDB and Neo4J (best of both worlds?), they even have VS pages for MongoDB and Neo4J.
Point is, I heard some horror stories of MongoDB losing data (though not sure it still does) and I don't have such luxury. And for Neo4J, I am not big fan of 12K€ per year "startup friendly" cost although I'll probably not have a DB of millions of vertexes. OrientDB seems a viable option as there may be also be some opportunities of using one database solution.
In that case, a logical move might be jumping to OrientDB but it has a small community and tbh didn't find much reviews about it, MongoDB and Neo4J are popular tools widely used, I have concerns if OrientDB is an adventure.
My first question would be if you have any experience/opinion regarding these databases.
And second question would be which Graph Database is better for a shipping simulation. Used Database is expected to calculate cheapest route from any vertex to any vertex and traverse it (classic Dijkstra). But also have to change weights depending on situations like "country B has embargo on country A so any item originating from country A can't pass through B, there is flood at region XYZ so no land transport is possible" etc. Also that database is expected to cache results. I expect no more than 1000 vertexes but many edges.
Thanks in advance and apologies in advance if questions are a bit ambiguous
PS : I added ArangoDB at title but tbh, hadn't much chance to take a look.
Late edit as of 18-Apr-2016 : After evaluating responses to my questions and development strategies, I decided to use ArangoDB as their roadmap is more promising for me as they apparently not trying to add tons of hype features that are half baked.
Disclaimer: I am the author and owner of OrientDB.
As developer, in general, I don't like companies that hide costs and let you play with their technology for a while and as soon as you're tight with it, start asking for money. Actually once you invested months to develop your application that use a non standard language or API you're screwed up: pay or migrate the application with huge costs.
You know, OrientDB is FREE for any usage, even commercial. Furthermore OrientDB supports standards like SQL (with extensions) and the main Java API is the TinkerPop Blueprints, the "JDBC" standard for Graph Databases. Furthermore OrientDB supports also Gremlin.
The OrientDB project is growing every day with new contributors and users. The Community Group (Free channel to ask support) is the most active community in GraphDB market.
If you have doubts with the GraphDB to use, my suggestion is to get what is closer to your needs, but then use standards as more as you can. In this way an eventual switch would have a low impact.
It sounds as if your use case is exactly what ArangoDB is designed for: you seem to need different data models (documents and graphs) in the same application and might even want to mix them in a single query. This is where a multi-model database as ArangoDB shines.
If MongoDB has served you well so far, then you will immediately feel comfortable with ArangoDB, since it is very similar in look and feel. Additionally, you can model graphs by storing your vertices in one (or multiple) collections, and your edges in one or more so-called "edge-collections". This means that individual edges are simply documents in their own right and can hold arbitrary JSON data. The database then offers traversals, customizable with JavaScript to match any needs you might have.
For your variations of the queries, you could for example add attributes about these embargos to your vertices and program the queries/traversals to take these into account.
The ArangoDB database is licensed under the Apache 2 license, and community as well as professional support is readily available.
If you have any more specific questions do not hesitate to ask in the google group
https://groups.google.com/forum/#!forum/arangodb
or contact
hackers (at) arangodb.org
directly.
Neo4j's pricing is actually quite flexible, so don't be put away by the prices on the website.
You can also get started with the community edition or personal edition for a long time.
The Neo4j community is very active and helpful and quickly provide support and help for your questions. I think that's the biggest plus besides performance and convenience. I
n general using a graph model
Regarding your use-case:
Neo4j is used exactly for this route calculation scenario by one of the largest logistic companies in the world where it routes up to 4000 packages per second across the country.
And it is used in other game engines, like here at GameSys for game economy simulation and in another one for the routing (not in earth coordinates but in game-world-coordinates using Neo4j-Spatial).
I'm curious why you have only that few nodes? Are those like transport portals? I wonder where you store the details and the dynamics about the routes (like the criteria you mentioned) are they coming from the outside - in memory state of the game engine?
You should probably share some more details about your model and the concrete use-case.
And it might help to know that both Emil, one of the founders of Neo4j and I are old time players of multi user dungeons (MUDs), so it is definitely a use-case close to our heart :)

Recommendation engine using google-prediction-api?

In google's prediction api page, it says we can use it for recommendation of webpages / products...
Can someone please show me how, for example:
I have 500,000 members purchased history
I have 2,000,000 products in 200 different categories
I have user-X just signup, asked him 15 'like' / 'dislike' product questions (user's taste)
Now, i want to suggest/recommend user-X with a list(e.g. 500) of products which he most likely willing to purchase
Thanks a lot
If you are not specifically tied to Google API fow whatever reason, explore using Mahout. This is a basic use case for the Mahout Recommendation mining.
https://cwiki.apache.org/MAHOUT/itembased-collaborative-filtering.html
The Google Prediction API, as currently implemented, is great for classifying data into a discrete set of categories, however, as noted in the documentation:
Avoid having a high ratio of categories to training data in categorical models.
Try to have at least a few dozen examples for each category, minimum.
For really good predictions, a few hundred examples per category is
recommended.
The Prediction API's classification doesn't work well when the ratio of categories to examples is high and in the example you sketched the relationship is one-to-one because you are trying to find the user whose liked product list is most similar to the user of interest (to find a set of promising products to recommend). In this model, each user is a unique category.