The format of my input file is the following:
PERSON1 BUILDING1
PERSON2 BUILDING4
PERSON3 BUILDING4
PERSON5 BUILDING3
PERSON3 BUILDING2
PERSON3 BUILDING1
PERSON5 BUILDING6
PERSON4 BUILDING6
1000 more rows like this
Each row should be read like this "the person X visited building Y"
I simply want to have clusters like this:
Cluster 1 : Persons that visited only 1 building (the same building)
Cluster 2 : Persons that visited only 2 buildings (the same buildings, let's say building 1 & 2)
Cluster 3 : Persons that visited only 2 buildings (the same buildings, let's say building 3 & 4)
Cluster 4 : Persons that visited only 3 buildings (the same buildings)
etc..
What would be the best way to do it? Is there a software ideally with data visualization that can do that? I tried Knime with no success.
You need to reformat your data appropriately.
The use a group_by operation based on the set of buildings visited.
This is much simpler than clustering.
I second #Anony-Mousse the solutions is more similar to use "group by" than make a clustering. So, with the idea to prove it works I built a simple code with knime getting the expected result. Then, for the visualization part you mention, maybe a correspondence analysis could be usuful, .
this chart is implemented in R (you can use R node) and shows how related is a entity (let's say visitors-blue) to another entity (let's say buildings-red) but ofcourse, the proper chart depends on your full data and intentions.
Related
I have a forest data structure which looks like below :
I am grouping the products 1,2 because person 2 owns both.
Similarly I am adding product 3 to this group because person 3 shares product 2 and 3
person and product 4 belongs to a different group as they do not share any products with any other person.
Question
Currently my input dataset looks like below:
And I want my output dataset to look as below:
I am trying to do this with SQL and I even achieved the desired results when I consider the dataset as a whole but the problem was the group IDs became transient.
If tomorrow an incremental dataset with product5 that is owned by person3 and person5 together comes in, I want the group1 in the target output table to autodetect the new person5 to the group.
I'm not sure the title is the best way to phrase it, here's the structure:
Structure
Here's the db json backup if you want to import it to test it: http://pastebin.com/iw2d3uuy
I'd like to get the Dishes eaten by the Humans living in Continent 1 until a _Parent Human moved to Continent 2.
Which means the target is Dish 1 & 2.
If a parent moved to another Continent, I don't want their dish nor the dishes of their children, even if they move back to Continent 1.
I don't know if it matters, but a Human can have multiple children.
If there wasn't the condition about the children of a Human who has moved from the Continent, this query would have worked:
SELECT expand(in('_Is_in').in('_Lives').in('_Eaten_by'))
FROM Continent WHERE continent_id = 1
But I guess here we're forced to use (among other things)
TRAVERSE out('_Parent') FROM Human WHILE
I've tried to use the while of traverse with a subquery to get all the Humans I'm interested in, before to try to get the Dishes, but I'm not even sure we can use while with a subquery.
I hope the structure will help other users to quickly find out if this query is useful to them. If anyone is wondering, I used the Graph tab of OrientDB Studio to make it, along with GIMP.
As a bonus, if anyone knows the Gremlin syntax, it would also be useful to learn it.
Please feel free to edit this post as you see fit and contribute your thoughts :)
SELECT expand(in('_Eaten_by'))
FROM (TRAVERSE out('_Parent')
FROM (SELECT from Human WHERE in('_Parent').size() = 0)
WHILE out('_Lives').out('_Is_in').continent_id = 1)
Explanation:
TRAVERSE out('_Parent')
FROM (SELECT FROM Human WHERE in('_Parent').size() = 0)
WHILE out('_Lives').out('_Is_in').continent_id = 1
returns Human 1 and 2.
That query traverses Human, starting from Human 1 while the Human is connected to Continent 1.
It starts from in('_Parent').size() = 0 which are the Humans without any _Parent (there's only Human 1 in this case) (size() is the size of the collection of vertices coming in from _Parent).
And SELECT expand(in('_Eaten_by')) FROM
gets the Dishes, starting from the Humans we got from the traversal and going through the edge _Eaten_by.
Note: be sure to always use ' around the vertices and edges names, otherwise the names don't seem to be taken in account.
I've been looking for an answer on the web since quite a long time, but I couldn't make it. So, I hope Stackoverflow users could help/advice me a bit.
I have 7 000 addresses (like "67, place Lobligeois 75017 Paris, France") and I would like to get a Shapefile that contains the 7 000 buildings corresponding to these 7 000 addresses.
My idea is to:
Use Mapquest API to get the "OSM node" for these 7 000 "addresses"
Use Overpass API to get, for all buildings in Paris, their "ways" and "nodes"
Match (1) et (3) to get the "ways" corresponding to my 7 000 "nodes/adresses"
Load in QGIS a shapefile (found at download.bbbike.org/osm/bbbike/Paris/) of all Paris buildings (shapefile where "OSM_ID" equals "way")
Find in my shapefile the "ways" obtained in (3) and delete all buildings that do not match.
Is it a good idea? Or is there a simpler way to do it (I hope)?
By the way, I am not able to download the data from my step 2, overpass-turbo.eu fails each time. Do you have any idea (is my Bbox too big)?
I would be delighted to get some advices/help.
Charles H.
Try to use this: https://github.com/kiselev-dv/gazetteer/tree/develop/Gazetteer
You can get csv with addresses, address components, osm id's and geometry as WKT string.
After that, you can compare points from step one by osm id or by address and filter csv rows you need.
Finally open csv in QGIS and save it as shape.
There are a couple of things I recommend.
Don't bother trying to extract the buildings. That will put a big hurting on your browser. Instead, grab one of the Geofabrik daily extracts for the Paris region. While those won't include the address nodes, they will have all the buildings.
Next do an overpass query for just addresses on nodes using the NominatimArea function. It looks like there are 30MB worth of them in Paris (!!), so you may have to break that area down into smaller districts, if Paris has any. Export that as GeoJSON and convert to shape.
What do we have as of now? - We are using Mahout's GenericItemBasedRecommender to get a list of recommended products for a user using TanimotoCoefficientSimilarity as ItemSimilarity.
Where do we want to go from here? - The above works fine when we don't care about product category but what we want to know is the Product Category specific recommendations i.e. Say if a user has been buying, browsing, liking etc. specifically more in Men's and Gadgets category, I would then want to show this user recommendation in that specific category saying Recommended for you in [X] where X would be replaced by Mens or Gadgets in this case. We are thinking about a couple of options below to achieve this and we need some leads/opinion/feedback etc. so as to make sure we are going in the right direction. Options:
Firstly we'll have to move to a non-tanimoto version for calculating item similarity so that we account for users buying, liking, etc and not only view/browsing data.
Figuring out product category for a particular user (this is where we need direction) - Our product category hierarchy is basically a tree and we need to know which top 4 nodes (with best recommendations) in tree we would show to the user. Also if we are saying that node X is a category which we are showing to the user and node Y is a parent of node X we then don't want show user products in category Y or any parent for that matter. Couple of ways achieving this:
For every user calculate SUM of similarity scores values of items for a nodes at leaf level and recursively calculate for parent node till the root. Now at each node we have A = SUM of similarity scores & B = Number of Items Recommended so we also have A/B=Value (V) at each node. Now we pick the top 4 V values from the tree and recommend that to the user. The challenge here is that if we try to calculate this online during the request it we would tough to limit this under 150 ms for the entire request. An Example:
Root Level - Category12 (A=11, B=4) (category1 + category2)
|
_____________________|_________________________
/ \
/ \
Leaf Level - category1 (A=6, B=2) category2 (A=5, B=2)
Recommended products in Category 1: Item1 (score = 2), Item2 (score = 4)
Recommended products in Category 2: Item3 (score = 1), Item4 (score = 4)
Second option: For every category create a cluster of users based on their behaviour (likes, buying, viewing etc.) and then figure out the top 4 categories to which the user belongs. Not sure if we can achieve this using clustering in Mahout but I think we can do this offline.
Please provide your feedback/suggestions/leads/thoughts.
Thanks in advance!
If you want to model more than one thing in your data, I would suggest to use the SVD recommender instead with the ALSWR factorizer set to implicit feedback. With that done you can have user,item,preference in your data and the preference value would be how strongly associated your user is to the item. You can play with the numbers, for example a purchase is a 20 and a view is just a 2. I'm just throwing numbers here, I wouldn't know what will work best for your data, because you can also model things proportionally, as in if a purchase is 30 times less likely to happen than a view, then a purchase should be 30 times stronger than a view.
Mahout provides a way to influence the recommendations through the IDRescorer. You implement your own logic here and decide how to affect the recommendations. For example, the IDRescorer would check if a recommendation candidate belongs to the same category and if it does, boost the score by X. There's an example here (link) from the Mahout in Action Book (which you should definitely read), showing a rescorer.
Hope this helps
I've a use-case wherein I've to distribute one set of objects (let's call it as Food objects) among two objects (say Person) satisfying certain conditions (say each Person has minimum energy requirement and say each Food object gives certain defined amount of energy). I would write rules for Person A and Person B. Could someone guide me if this can be achieved using drools. If so, how.
Assume I've following domain objects
Person :
requirement
List<Food>
Food :
energy
Say I've added Person A and Person B and List of 10 food objects to the knowledgeBase.
First answer the following question:
Can you take a food from the unassigned food list and always decide which Person it should go to, independently of how many other foods that or other persons have already been assigned?
If the answer is yes, use Drools Expert with rules like
when
$f : Food(unassigned == true)
FoodLike($p Person, foodLike == $f; $l : likeness)
not FoodLike(foodLike == $f; likeness > $l)
then
// assign $f
If the answer is no, you got a bin packing problem, which is NP-complete. In that case use Drools Planner, see this video of a bin packing problem. So just copy-paste that example (called cloudbalance), where the computers would be your persons and the processes would be your food objects.