Representing tree structure in kdb - kdb

How would I represent the below tree structure with the values in each node in kdb?
a : 4
b : 3
c : 1
d : 2
e : 7
f : 5
g : 2
I would need to setup a function to sum the values at the nodes too.
Any tips appreciated.

There are different approaches you can try.
For ex: TreeTable: table having parent and child column.
A treetable is a table with additional properties.
Firstly, the records of the table are related hierarchically. Thus, a record may have one or more child-records, which may in turn have children. If a record has a parent, it has exactly one. A record without a parent is called a root record. A record without any children is called a leaf record. A record with children is called a node record.
Following paper explains TreeTable. http://archive.vector.org.uk/art10500340

A nested-dictionary approach would be:
q)dict.a.b.c:1
q)dict.a.b.d:2
q)dict.a.b[`]:3
q)dict.a[`]:4
q)dict.a.e.f.g:2
q)dict.a.e.f[`]:5
q)dict.a.e[`]:7
/note that it is important that each node is defined starting at the deepest nodes and working backwards to the top level nodes.
/to see the hierarchy use dict.a or dict.a.e etc, or more dynamically use
q)#/[dict;`a]
| 4
b| ``c`d!3 1 2
e| ``f!(7;``g!5 2)
/to get the values at individual nodes
q)first #/[dict;`a`e`f]
5
q)first #/[dict;`a`e`f`g]
2
/to find all values under a node
q){raze $[99h=type x;.z.s each x;x]}dict.a
4 3 1 2 7 5 2
q){raze $[99h=type x;.z.s each x;x]}#/[dict;`a`e]
7 5 2
/to sum all values under a node
q)sum {raze $[99h=type x;.z.s each x;x]}dict.a
24
q)sum {raze $[99h=type x;.z.s each x;x]}#/[dict;`a`e]
14
Obviously these can all be wrapped up into nice neat functions if necessary.

IMO it depends on what how you plan to "query" the tree, and what you want out of it - can you elaborate further? Do you want sub-trees? E.g. if you query for "a" will it give you "4" OR the whole sub-tree (in this case it's the whole tree).
If memory isn't an issue, or your tree is small, you could could have either a nested dictionary of dictionaries, with a function that recurses inside the dictionaries, or you could have the letter symbols as the key of a single dictionary:
q)d:(enlist enlist `a)!enlist 4
q)d,:(enlist `a`b)!enlist 3
q)d,:(enlist `a`b`c)!enlist 1
q)d
,`a | 4
`a`b | 3
`a`b`c| 1
q)d`a
4
q)d`a`b
3
q)d`a`b`c
1

Related

Select cases if value is greater than mean of group

Is there a way to include means of entire variables in Select Cases If syntax?
I have a dataset with three groups n=20 each (sorting variable grp with values 1, 2, or 3) and results of a pre and post evaluation (variable pre and post). I want to select for every group only the 10 cases where the pre value is higher than the mean of that value in the group.
In pseudocode:
select if pre-value > mean(grp)
So if the mean in group 1 is 15, that's what all values from group one cases should be compared to. But at the same time if group 2's mean is 20, that is what values from cases in group 2 should be compared to.
Right now I only see the MEAN(arg1,arg2,...) function in the Select Cases If window, but no possibility to get the mean of an entire variable, much less with an additional condition (like group).
Is there a way to do this with Select Cases If syntax, or otherwise?
You need to create a new variable that will contain the mean of the group (so all lines in each group will have the same value in this variable - group mean). You can then compare each line to this value .
First I'll create some example data to demonstrate on:
data list list/grp pre_value .
begin data
1 3
1 6
1 8
2 1
2 4
2 9
3 55
3 43
3 76
end data.
Now you can calculate the group mean and select:
AGGREGATE /OUTFILE=* MODE=ADDVARIABLES /BREAK=grp /GrpMean=MEAN(pre_value).
select if pre_value > GrpMean.
.

(Almost) adjacency postgres table that has mixed up relations

I have an (almost) adjacency list postgres table that uses relationships like so:
Id.A Id.B
1 4
2 3
1 5
5 6
7 8
3 4
5 7
Giving:
1
/ \
5 4
/ \ \
6 7 3
\ \
8 2
And I wish to find the node with the smallest value. It isn't necessarily the root, because that doesn't really exist. I need to somehow traverse the graph from any point and find the node with the smallest value. The problem is as things currently are, its not an adjacency table per say.
I am fairly sure this has to use a recursive CTE but dont quite know where to start.
And, as I think I may not have been too clear here, there are a bunch of these graphs all muddled up within the same table, not just one single graph, so for each row it would need to traverse its own graph to find its own smallest value node.
Any help would be much appreciated, thanks.
The root nodes are the only nodes not showing up in id.b, so let's start from there recursively:
WITH RECURSIVE roots(root, child) AS (
SELECT a, a FROM id
EXCEPT
SELECT b, b FROM id
UNION ALL
SELECT r.root, id.b
FROM roots r
JOIN id ON r.child = id.a
) SELECT root, child FROM roots;

How to pick items from warehouse to minimise travel in TSQL?

I am looking at this problem from a TSQL point of view, however any advice would be appreciated.
Scenario
I have 2 sets of criteria which identify items in a warehouse to be selected.
Query 1 returns 100 items
Query 2 returns 100 items
I need to pick any 25 of the 100 items returned in query 1.
I need to pick any 25 of the 100 items returned in query 2.
- The items in query 1/2 will not be the same, ever.
Each item is stored in a segment of the warehouse.
A segment of the warehouse may contain numerous items.
I wish to select the 50 items (25 from each query) in a way as to reduce the number of segments I must visit to select the items.
Suggested Approach
My initial idea has been to combined the 2 result sets and produce a list of
Segment ID, NumberOfItemsRequiredInSegment
I would then select 25 items from each query, giving preference to those in a segments with the most NumberOfItemsRequiredInSegment. I know this would not be optimal but would be an easy to implement heuristic.
Questions
1) I suspect this is a standard combinational problem, but I don't recognise it.. perhaps multiple knapsack, does anyone recognise it?
2) Is there a better (easy-ish to impliment) heuristic or solution - ideally in TSQL?
Many thanks.
This might also not be optimal but i think would at least perform fairly well.
Calculate this set for query 1.
Segment ID, NumberOfItemsRequiredInSegment
take the top 25, Just by sorting by NumberOfItemsRequiredInSegment. call this subset A.
take the top 25 from query 2, by joining to A and sorting by "case when A.segmentID is not null then 1 else 0, NumberOfItemsRequiredInSegmentFromQuery2".
repeat this but take the top 25 from query 2 first. return the better performing of the 2 sets.
The one scenario where i think this fails would be if you got something like this.
Segment Count Query 1 Count Query 2
A 10 1
B 5 1
C 5 1
D 5 4
E 5 4
F 4 4
G 4 5
H 1 5
J 1 5
K 1 10
You need to make sure you choose A, D, E, from when choosing the best segments from query 1. To deal with this you'd almost still need to join to query two, so you can get the count from there to use as a tie breaker.

Calculating change in leaders for baseball stats in MSSQL

Imagine I have a MSSQL 2005 table(bbstats) that updates weekly showing
various cumulative categories of baseball accomplishments for a team
week 1
Player H SO HR
Sammy 7 11 2
Ted 14 3 0
Arthur 2 15 0
Zach 9 14 3
week 2
Player H SO HR
Sammy 12 16 4
Ted 21 7 1
Arthur 3 18 0
Zach 12 18 3
I wish to highlight textually where there has been a change in leader for each category
so after week 2 there would be nothing to report on hits(H); Zach has joined Arthur with most strikeouts(SO) at
18; and Sammy is new leader in homeruns(HR) with 4
So I would want to set up a process something like
a) save the past data(week 1) as table bbstatsPrior,
b) updates the bbstats for the new results - I do not need assistance with this
c) compare between the tables for the player(s with ties) with max value for each column
and spits out only where they differ
d) move onto next column and repeat
In any real world example there would be significantly more columns to calculate for
Thanks
Responding to Brents comments, I am really after any changes in the leaders for each category
So I would have something like
select top 1 with ties player
from bbstatsPrior
order by H desc
and
select top 1 with ties player,H
from bbstats
order by H desc
I then want to compare the player from each query (do I need to do temp tables) . If they differ I want to output the second select statement. For the H category Ted is leader `from both tables but for other categories there are changes between the weeks
I can then loop through the columns using
select name from sys.all_columns sc
where sc.object_id=object_id('bbstats') and name <>'player'
If the number of stats doesn't change often, you could easily just write a single query to get this data. Join bbStats to bbStatsPrior where bbstatsprior.week < bbstats.week and bbstats.week=#weekNumber. Then just do a simple comparison between bbstats.Hits to bbstatsPrior.Hits to get your difference.
If the stats change often, you could use dynamic SQL to do this for all columns that match a certain pattern or are in a list of columns based on sys.columns for that table?
You could add a column for each stat column to designate the leader using a correlated subquery to find the max value for that column and see if it's equal to the current record.
This might get you started, but I'd recommend posting what you currently have to achieve this and the community can help you from there.

Aggregate path counts using HierarchyID

Business problem - understand process fallout using analytics data.
Here is what we have done so far:
Build a dictionary table with every possible process step
Find each process "start"
Find the last step for each start
Join dictionary table to last step to find path to final step
In the final report output we end up with a list of paths for each start to each final step:
User Fallout Step HierarchyID.ToString()
A 1/1/1
B 1/1/1/1/1
C 1/1/1/1
D 1/1/1
E 1/1
What this means is that five users (A-E) started the process. Assume only User B finished, the other four did not. Since this is a simple example (without branching) we want the output to look as follows:
Step Unique Users
1 5
2 5
3 4
4 2
5 1
The easiest solution I could think of is to take each hierarchyID.ToString(), parse that out into a set of subpaths, JOIN back to the dictionary table, and output using GROUP BY.
Given the volume of data, I'd like to use the built-in HierarchyID functions, e.g. IsAncestorOf.
Any ideas or thoughts how I could write this? Maybe a recursive CTE?
Restructuring the data may help with this. For example, structuring the data like this:
User Step Process#
---- ---- --------
A 1 1
A 2 1
A 3 1
B 1 2
B 2 2
B 3 2
B 4 2
B 5 2
E 1 3
E 2 3
E 1 4
E 2 4
E 3 4
Allows you to run the following query:
select step,
count(distinct process#) as process_iterations,
count(distinct user) as unique_users
from stepdata
group by step
order by step;
which returns:
Step Process_Iterations Unique_Users
---- ------------------ ------------
1 4 3
2 4 3
3 3 3
4 1 1
5 1 1
I'm not familiar with hierarchyid, but splitting out that data into chunks for analysis looks like the sort of problem numbers tables are very good for. Join a numbers table against the individual substrings in the fallout and it shouldn't be too hard to treat the whole thing as a table and analyse it on the fly, without any non-set operations.