How to assign a static non-changing root for a forest data structure - pyspark

I have a forest data structure which looks like below :
I am grouping the products 1,2 because person 2 owns both.
Similarly I am adding product 3 to this group because person 3 shares product 2 and 3
person and product 4 belongs to a different group as they do not share any products with any other person.
Question
Currently my input dataset looks like below:
And I want my output dataset to look as below:
I am trying to do this with SQL and I even achieved the desired results when I consider the dataset as a whole but the problem was the group IDs became transient.
If tomorrow an incremental dataset with product5 that is owned by person3 and person5 together comes in, I want the group1 in the target output table to autodetect the new person5 to the group.

Related

Clustering with data visualization

The format of my input file is the following:
PERSON1 BUILDING1
PERSON2 BUILDING4
PERSON3 BUILDING4
PERSON5 BUILDING3
PERSON3 BUILDING2
PERSON3 BUILDING1
PERSON5 BUILDING6
PERSON4 BUILDING6
1000 more rows like this
Each row should be read like this "the person X visited building Y"
I simply want to have clusters like this:
Cluster 1 : Persons that visited only 1 building (the same building)
Cluster 2 : Persons that visited only 2 buildings (the same buildings, let's say building 1 & 2)
Cluster 3 : Persons that visited only 2 buildings (the same buildings, let's say building 3 & 4)
Cluster 4 : Persons that visited only 3 buildings (the same buildings)
etc..
What would be the best way to do it? Is there a software ideally with data visualization that can do that? I tried Knime with no success.
You need to reformat your data appropriately.
The use a group_by operation based on the set of buildings visited.
This is much simpler than clustering.
I second #Anony-Mousse the solutions is more similar to use "group by" than make a clustering. So, with the idea to prove it works I built a simple code with knime getting the expected result. Then, for the visualization part you mention, maybe a correspondence analysis could be usuful, .
this chart is implemented in R (you can use R node) and shows how related is a entity (let's say visitors-blue) to another entity (let's say buildings-red) but ofcourse, the proper chart depends on your full data and intentions.

Display Columns with a Value as List in FMP View

GIVEN:
A FMP database that has the following columns in 1 table:
Student
other data to be displayed
Test1_Grade
Test2_Grade
Test3_Grade
Test4_Grade
Test5_Grade
WHEN:
StudentA only gets grades for tests 1, 3, and 5 and
StudentB only get grades for tests 1 and 4
HOW: would you display only the test fields that have a value as a list for each student in a view?
Ex:
StudentA
... other data to be displayed ...
Test1_Grade A
Test3_Grade B-
Test5_Grade B
... other data to be displayed ...
I would put the grades in a related table and use a portal to enter/show the relevant grades.
Then you could make a calculation field using the List function to retrieve the related values.
If that’s not an option you could make a calculation field using simple Case structures to include only data from the fields that are relevant or not empty and use that for display.

How to best structure csv data for tableau that has "multiple categories"?

I have a set of 100 “student records”, I want to have checkboxes for each "favorite_food_type" and "favorite_food", whichever is checked would filter a "bar graph" that counts number of reports that contain that specific "favorite_food"type" and "favorite_food" schema could be:
name
favorite_food_type (e.g. vegetable)
favorite_food (e.g. banana)
I would like to in the dashboard be able to select via checkboxes, “Give me all the COUNT OF DISTINCT students with favorite_food of banana, apple, pear“ and filter graphs for all records. My issue is for a single student record, maybe one student likes both banana and apple. How do I best capture that? Should I have:
CASE A: Duplicate Records (this captures the two different “favorite_food”, but now I have to figure out how many students there are (which is one student)
NAME, FAVORITE_FOOD_TYPE,FRUIT
Charlie, Fruit, Apple
Charlie, Fruit, Pear
CASE B: Single Records (this captures the two different “favorite_food”, but is there a way to pick out from delimiters?)
NAME, FAVORITE_FOOD_TYPE,FRUITS
Charlie, Fruit, Apple#Pear
CASE C: Column for Each Fruit (this captures one record per student, but need a loooot of columns for each fruit, many would be false)
NAME, FAVORITE_FOOD_TYPE, APPLE, BANANA, PINEAPPLE, PEAR
Charlie, Fruit, TRUE, FALSE, TRUE, FALSE
I want to do this as easy as possible.
Avoid Case B if at all possible. Repeating information is almost always best handled by repeating rows -- not by cramming multiple values into a single table cell, nor by creating multiple columns such as Favorite_1 and Favorite_2
If you are provided data with multiple values in a field, Tableau does have functions and data connection features that can be used to split a single field into its constituent parts to form multiple fields. That works well with fixed number of different kinds of information -- say splitting a City, State field into separate fields for City and State.
Avoid Case C if at all possible. That cross tab structure makes it hard to analyze the data and make useful visualizations. Each value is treated as a separated field.
If you are provided data in crosstab format, Tableau allows you to pivot the data in the data connection pane to reshape into a form with fewer columns and many rows.
Case A is usually the best approach. You can simplify it further by factoring out repeating information into separated tables -- a process known as normalization. Then you can use a join to recombine the tables and see the repeating information when desired.
A normalized approach to your example would have two tables (or tabs in excel). The first table would have exactly one row per student with 2 columns: name and favorite_food_type. The second table would have a row per student/favorite food combination, with 2 columns: name and favorite_food. Now each student can have as many favorite foods as you like or none at all. Since both columns have a name field, that would be the key used to join (combine) the tables when needed.
Given that table design, you could have 2 data sources in Tableau. The first one just pointed to the student table and could be used to create visualizations that only involved students and favorite_food_types. The second data source would use a (left) join to read from both tables and could be used to look at favorite foods. When working with the second data source, you would have to be careful about reporting information about student names and favorite food types to account for the duplicate information. So use the first data source when possible. Finally, you could put both kinds of visualizations on a dashboard and use filter and highlight actions to make interaction seamless despite the two sources -- getting the best of both worlds.

recover sort order/position values using magmi with multiple website/store/storeviews

I've been using Magmi with great success, creating and updating our magento products on a daily basis.
Our production retail site generally uses the default/admin values for store. When I make new categories and populate them I generally use the category_reset=0 column to preserve the handmade sort order or position values for all of the original categories.
I've been working on a wholesale site set up with a seperate filesystem for all 3 levels of the Magento hierarchy. I did an import with magmi setting the store column to the wholesale site, with 2 additional collumns - sku and category_ids (without category_reset) using a sub-set of data exported from the admin store view (filtered the manufacturer column for only one manufacturer) to try to populate the wholesale site categories (same root catalog with certain categories disabled or not visible) with the same category products.
For some reason, I'm not sure why, (ouch, I realize now there was a typo in the header name for store) it did not update the right store - it defaulted back to admin and lost
the sort order for many categories, about 3k products imported ok.
I have 2 non-production sandbox sites with duplicate category data. I've been manually copying the category product listings with the desired position values into a new csv so I will have sku,category_id (singular),position_value
Many products are members of more than one category. My question is...
In order to regain the position values or sort order, what syntax should I use under category_ids? The products are already in the category so I would use a category_reset=0 column, right?
for an example record:
sku category_ids
45000 39,262,353
my next import might look like:
sku category_ids category_reset
abc 39::10 0
def 39::20 0
45000 39::30 0
ghi 262::10 0
45000 262::20 0
jkl 262::30 0
45000 353::10 0
mno 353::20 0
does this seem workable? I'm feeling very gunshy after having borked my production site with a typo and need some validation before I take steps to confuse myself further.
Thanks in advance for any insight.
As stated in the Magmi Documentation for Importing item positions in categories (from magmi version 0.7.18), the syntax is as follows:
sku,....,category_ids
000001,...,"8::1" < = put sku 00001 at position 1 in category with id 8
000002,...,"9::4,7" < = put sku 00002 at position 4 in category with id 9 and at position 0 in category with id 7
000003,...,"8::10" <= put sku 00002 at position 10 in category with id 8
So yes, your method should work. Be sure to do a full database backup before doing major import changes ;)

Aggregation on Hierarchy Data types

I've been reading about the Hierarchy data type in MS SQL2012. I'm trying to store an organisation data structure with values at each level. I'm wondering how do you aggregate data that is associated to a Hierarchy data column.
for instance say I want to sum up everything to 3 levels from the top of the hierarchy, what would I use to do that. What I use a group by or a roll-up or is there some new function I could use on the hierarchy data type.
To sum up everything up to 3 levels it looks like you could use GetLevel as per this link
http://technet.microsoft.com/en-us/3b4f7dae-65b5-4d8d-8641-87aba9aa692d
http://msdn.microsoft.com/en-au/library/bb677197(v=sql.100).aspx
SELECT SUM(z)
FROM HumanResources.EmployeeOrg
WHERE OrgNode.GetLevel() BETWEEN 0 AND 2