Using GraphFrames (Scala) to compute hierarchy - scala

I have a dataframe below:
employee_id|employee_name|manager_employee_id|
----------------------------------------------
1 eric (ceo) 1
2 edward 1
3 john 1
4 james 2
5 ella 4
I would like to use spark (scala) graphframe logic to achieve the following:
employee_id|employee_name|manager_employee_id|level|hierarchy
-------------------------------------------------------------
1 eric 1 0 /1
2 edward 1 1 /1/2
3 john 1 1 /1/3
4 james 2 2 /1/2/4
5 ella 3 3 /1/2/4/5
Any help would be much appreciated

I think the answer you are looking for is more or less related to this.
The only modification required at your end will be aggregating the messages to get the complete hierarchy .
For that part you can refer to this.
A combination of above two will get you the desired results.

Related

Group by specific column in PostgreSQL

I trying to get a count of person depending on the org_id.
Let's say I have 10 people in my people coming in from different organization.
I want to count no of people working in separate with each record listed.
Please click here on SQL Fiddle link to get what exactly I am trying to do.
Postgres version 9.4
Below are my table records:
id person_name emp_id org_id
1 John Mackenzie TTT104 1
2 Raven Raid TTT105 1
3 Albert Pinto TTT106 2
4 Albert Pinto1 TTT119 2
5 Ram Raman TTT108 2
6 Huge Jackman TTT109 2
7 Peter Pan TTT107 2
8 Albert Pinto2 TTT106 2
RESULT EXPECTED:
id person_name emp_id count(org_id)
1 John Mackenzie TTT104 2
2 Raven Raid TTT105 2
3 Albert Pinto TTT106 6
4 Albert Pinto1 TTT119 6
5 Ram Raman TTT108 6
6 Huge Jackman TTT109 6
7 Peter Pan TTT107 6
8 Albert Pinto2 TTT106 6
As shown in the image I want my records to look in my velocity template:
While collecting results, you The solution you are looking for is the following:
SELECT org_id, count(*)
FROM person
GROUP BY org_id;
Basically with this query you are collecting the number of people working in each distinct org_id.
Result of the query is, then:
rg_id | count
-------------
1 | 2
2 | 6
Execute query as below to solve the issue:
SELECT p.person_name,
p.emp_id,
count(p.org_id) OVER w as org
FROM person p WINDOW w AS (PARTITION BY org_id);

SPSS - merging files with duplicate cases of ID variable and new cases/variables

I have an administrative dataset for store visits from multiple years that I am trying to merge into one under the ID variable.
Each dataset has duplicates of an ID that occur during different store visits, annotated by Date. Some of the more recent data files also have new variables (Y) not contained in the old data files. Datasets from different years will also contain different cases indicated by different ID. Also, some variables may be the same for each case but at different dates. I want the merged file to retain these duplicates.
Example data files:
File 1
ID Date X
1 3 4
1 5 3
2 1 4
File 2
ID Date X Y
1 6 4 2
1 7 1 5
2 8 4 7
3 7 2 3
I want the merged file to continue listing ALL duplicate cases, as such:
ID Date X Y
1 3 4 .
1 5 3 .
1 6 4 2
1 7 1 5
2 1 4 .
2 8 4 7
3 7 2 3
I then plan to restructure (CASESTOVARS /AUTOFIX=0) the merged file so that it looks like this:
ID Date.1 Date.2 Date.3 Date.4 X.1 X.2 X.3 X.4 Y.1 Y.2 Y.3 Y.4
1 3 5 6 7 4 3 4 1 . . 2 5
2 1 8 . . 4 4 . . . 7 . .
3 7 . . . 2 . . . 3 . . .
I am having trouble with the initial merging process, however. I have tried looking up the safest way to merge files when they both have duplicate cases in order to make sure no data are lost in the process. It seems that the "Add Variables" method results in lost values for duplicate variables.
Thanks!
EDIT: If I used the "Add Variables" function and used both the ID and Date variables as the key variables, would that help avoid deletion of duplicate cases?
Why not try add cases instead of add variables? if there are no occurrences of the same Id with the same date it should work OK with the casestovars.
If there are such cases, you'll need to think what you want to do with them before you can proceed with the casestovars.
One way would be to aggregate by ID and DATE and decide if you want to e.g. add up the data vars for this case.

"Inserting" Records into Fields from a Database Feed

So the background to this is I'm trying to create a survival curve based on a database feed from the directions here.
What I have so far is three calculated fields per below. Patient ID is not a calculated field or necessary for the survival analysis, but I believe it could be useful for this question. For reference, there are about 20,000 unique patients.
Patient ID | Time | Censor | Group
Id1 3 0 1
Id2 8 0 2
Id3 1 1 1
Id4 3 1 1
Id5 11 0 1
Id5 7 1 2
What I would like to do is insert two records (one for each group) such:
Patient ID | Time | Censor | Group | Link
0 1
0 2
Id1 3 0 1 link
Id2 8 0 2 link
Id3 1 1 1 link
Id4 3 1 1 link
Id5 11 0 1 link
Id5 7 1 2 link
I unsuccessfully tried to create an excel spreadsheet with these base attributes to union with the columns, however, an excel spreadsheet does not appear to be able to union with a database.
My next idea is to find 2 of the 20,000 patients where I can create a calculated field along these lines (not sure this is feasible in Tableau, please excuse my syntax):
IF [Patient ID] = Id3 THEN [TIME] = 0 AND [CENSOR] IS NULL
END
and then a [Link] calculated formula:
IF [Patient ID] = Id3 THEN NULL
ELSE "link"
END
Any help would be appreciated. Would like to avoid inserting these records in the database.
The best / easiest option is to use an outer join to your excel workbook -- this is a new feature in Tableau version 10 (Cross database joins)
Then, once the dataset is combined, you can build business logic through a filter or calculated field based on the absence or presence of the Excel data.
http://www.tableau.com/about/blog/2016/7/integrate-your-data-cross-database-joins-56724

Please help me 'grok' triplestores

I'm an RDBMS person from way back. I'm trying to grok triplestore. I think my "confusion" may be addressed with the answer to the following question:
How is this...
Table (Subjects):
ID Subject Details
1 Barney …
2 Fred …
3 Picture …
4 …
Table2 (Predicates):
ID Predicate Details
1 friendOf …
2 marriedTo …
3 hasTimeStamp …
4 hasGeoCoord …
5 hasEventName …
6 belongsTo …
7 containsPerson …
8 …
Table3 (Objects) - These may be Subjects as well:
ID Object SubjectID Details
1 Fred 2 …
2 Wilma NULL …
3 January 1, 2010 1530 NULL …
4 46°12′N NULL …
5 6°09′E NULL …
6 Wedding NULL …
7 Ski Trip NULL …
8 Barney 1 …
9 …
Table4 (Triplestores)
ID SubjectID PredicateID ObjectID Details
1 1 1 2 …
2 2 2 3 …
3 3 3 3 …
4 3 4 4 …
5 3 4 5 …
6 3 5 6 …
7 3 5 7 …
8 3 7 8 …
9 3 7 2 …
10 3 7 1 …
11 …
So #9 in Tripstore is: Picture containsPerson Fred
...Not a triplestore?
If it is then please comment on why this implementation (as an RDBMS) is inefficient etc.
Thanks in advance!!
It's possible, and easy to some degree, to implement a triple store on top of an RDBMS. There are several systems currently available that do this with varying degrees of success. However, they tend not perform all that well due to transitive self joins their design usually requires. This is why serious vendors who are providing a triple store backed by a relational database, such as Oracle, have customized handling to help improve their efficiency in these situations.
In my experience, native triple stores, those designed for the purpose of storing and querying RDF, always outperform solutions shoehorned on top of a relational system. So while they're very much databases and have a lot in common with a traditional RDBMS, there are still design choices in their implementation that makes them better suited for answering SPARQL queries.

Aggregate path counts using HierarchyID

Business problem - understand process fallout using analytics data.
Here is what we have done so far:
Build a dictionary table with every possible process step
Find each process "start"
Find the last step for each start
Join dictionary table to last step to find path to final step
In the final report output we end up with a list of paths for each start to each final step:
User Fallout Step HierarchyID.ToString()
A 1/1/1
B 1/1/1/1/1
C 1/1/1/1
D 1/1/1
E 1/1
What this means is that five users (A-E) started the process. Assume only User B finished, the other four did not. Since this is a simple example (without branching) we want the output to look as follows:
Step Unique Users
1 5
2 5
3 4
4 2
5 1
The easiest solution I could think of is to take each hierarchyID.ToString(), parse that out into a set of subpaths, JOIN back to the dictionary table, and output using GROUP BY.
Given the volume of data, I'd like to use the built-in HierarchyID functions, e.g. IsAncestorOf.
Any ideas or thoughts how I could write this? Maybe a recursive CTE?
Restructuring the data may help with this. For example, structuring the data like this:
User Step Process#
---- ---- --------
A 1 1
A 2 1
A 3 1
B 1 2
B 2 2
B 3 2
B 4 2
B 5 2
E 1 3
E 2 3
E 1 4
E 2 4
E 3 4
Allows you to run the following query:
select step,
count(distinct process#) as process_iterations,
count(distinct user) as unique_users
from stepdata
group by step
order by step;
which returns:
Step Process_Iterations Unique_Users
---- ------------------ ------------
1 4 3
2 4 3
3 3 3
4 1 1
5 1 1
I'm not familiar with hierarchyid, but splitting out that data into chunks for analysis looks like the sort of problem numbers tables are very good for. Join a numbers table against the individual substrings in the fallout and it shouldn't be too hard to treat the whole thing as a table and analyse it on the fly, without any non-set operations.