Aggregate path counts using HierarchyID - tsql

Business problem - understand process fallout using analytics data.
Here is what we have done so far:
Build a dictionary table with every possible process step
Find each process "start"
Find the last step for each start
Join dictionary table to last step to find path to final step
In the final report output we end up with a list of paths for each start to each final step:
User Fallout Step HierarchyID.ToString()
A 1/1/1
B 1/1/1/1/1
C 1/1/1/1
D 1/1/1
E 1/1
What this means is that five users (A-E) started the process. Assume only User B finished, the other four did not. Since this is a simple example (without branching) we want the output to look as follows:
Step Unique Users
1 5
2 5
3 4
4 2
5 1
The easiest solution I could think of is to take each hierarchyID.ToString(), parse that out into a set of subpaths, JOIN back to the dictionary table, and output using GROUP BY.
Given the volume of data, I'd like to use the built-in HierarchyID functions, e.g. IsAncestorOf.
Any ideas or thoughts how I could write this? Maybe a recursive CTE?

Restructuring the data may help with this. For example, structuring the data like this:
User Step Process#
---- ---- --------
A 1 1
A 2 1
A 3 1
B 1 2
B 2 2
B 3 2
B 4 2
B 5 2
E 1 3
E 2 3
E 1 4
E 2 4
E 3 4
Allows you to run the following query:
select step,
count(distinct process#) as process_iterations,
count(distinct user) as unique_users
from stepdata
group by step
order by step;
which returns:
Step Process_Iterations Unique_Users
---- ------------------ ------------
1 4 3
2 4 3
3 3 3
4 1 1
5 1 1

I'm not familiar with hierarchyid, but splitting out that data into chunks for analysis looks like the sort of problem numbers tables are very good for. Join a numbers table against the individual substrings in the fallout and it shouldn't be too hard to treat the whole thing as a table and analyse it on the fly, without any non-set operations.

Related

Get consecutive sequence number in ireport

I need to display row number sequence of each group.
I have used $V{PAGE_COUNT} and evaluation time as now
The report data that I am getting is
Group A
1.
2
3
4
...........
page ends ......
Group A
1
2
3
4
page ends ---------
Group B
1
2
3
4
5
page ends....
But my requirement is
Group A
1.
2
3
4
...........
page ends
Group A
5
6
7
8
9
page ends .......
Group B
1
2
3
4
5
page ends....
I need all rows of same group to be continuous sequence. And start sequence from 1 when group is changed
You should use the GroupName_COUNT variable in this case.
The quote from the JasperReports Ultimate Guide
When declaring a report group, the engine automatically creates a count variable that
calculates the number of records that make up the current group (that is, the number of
records processed between group ruptures).
The name of this variable is derived from the name of the group it corresponds to,
suffixed with the _COUNT sequence. It can be used like any other report variable, in any
report expression, even in the current group expression, as shown in the BreakGroup
group of the /demo/samples/jasper sample)
More info is here: Data Grouping

Is it possible to return only one instance of each ID in a view?

I am trying to work out how I would ensure I only get one instance of each user and their ID when I try to do an inner join on my source table.
The source data is a series of user names and IDs
userid username
1 alice
2 bob
3 charley
4 dave
5 robin
6 jon
7 lou
8 scott
I have had to write the algorithm in python, to make sure I only get one set of user data matches with another (so we can make sure each user's data is used once in each round)
We store the pairings, and how many rounds have been completed successfully after the tests, but I'd like to try and shortcut the process
I can get all the results, but I want to find a better way to remove each matched pair from the results so they can't be matched again.
select u.user_id, u.user, ur.user_id, ur.user
from userResults u inner join userResults ur
on u.user_id < ur.user_id
and (u.user_id, ur.user_id) not in
(select uid, uid2 from rounds)
where u.match <= ur.match and ((u.user_id) not in %s
and ur.user_id not in %s) limit 1;
I've tried making materialised views with a unique constraint, but it doesn't seem to affect it - I get each possible pairing once, rather than each user paired only once
I'm trying to work out how I only get 4 results, in the right order.
Every time I look at the underlying code, I can't help but think there's a better way to write it natively in SQL rather than having to iterate over results in python.
edit
assuming each user has been matched 0 or more times, you might have a situtation where user_id's 1-4 have rounds set to 1, and matches set to 1, and the remaining 4 have rounds set to 1 and no matches.
I have a view which will return a default value of 0 and 0 for rounds and matches if they haven't yet played, and you can't assume all rounds entered have met with a match.
If the first 4 have all matched, and have generated rounds, user 1 and user 2 have already met and matched in a round, so they won't be matched again, so user 1 will match user 3 (or 4) and user 2 will match user 4 (or 3)
The issue I'm having is that when I remove limit, and iterate through manually - the first three matches I always get are: 2,4 then 1,3, then 2,3 (rather than 5,7 or 6,8)
Adding the sample data and current rounds
table rounds
uid uid2
1 2
3 4
userresults view
user_id user rounds score
1 alice 1 0
2 bob 1 1
3 charley 1 1
4 dave 1 0
5 robin 0 0
6 jon 0 0
7 lou 0 0
8 scott 0 0
I'm currently getting results like:
2,4
2,3
1,3
1,4
4,6
...
These are all valid results, but I would like to limit them to a single instance of each ID in each column, so just the first match of each valid pairing.
I've created a new view to try and simplify things a but, and populated it with dummy data and tried to generate matches
All these matches are valid, and I'm trying to add some form of filter or restriction to bring it back to sensible numbers.
777;"Bongo Steveson";779;"Drongo Mongoson"
777;"Bongo Steveson";782;"Cheesey McHamburger"
777;"Bongo Steveson";780;"Buns Mattburger"
779;"Drongo Mongoson";782;"Cheesey McHamburger"
779;"Drongo Mongoson";781;"Hamburgler Bunburger"
775;"Bob Jobsworth";777;"Bongo Steveson"
778;"Mongo Bongoson";779;"Drongo Mongoson"
775;"Bob Jobsworth";778;"Mongo Bongoson"
778;"Mongo Bongoson";781;"Hamburgler Bunburger"
775;"Bob Jobsworth";782;"Cheesey McHamburger"
775;"Bob Jobsworth";781;"Hamburgler Bunburger"
775;"Bob Jobsworth";780;"Buns Mattburger"
776;"Steve Bobson";777;"Bongo Steveson"
776;"Steve Bobson";779;"Drongo Mongoson"
776;"Steve Bobson";782;"Cheesey McHamburger"
776;"Steve Bobson";778;"Mongo Bongoson"
776;"Steve Bobson";781;"Hamburgler Bunburger"
780;"Buns Mattburger";782;"Cheesey McHamburger"
780;"Buns Mattburger";781;"Hamburgler Bunburger"
I still can't work out a sensible way to restrict these values, and it's driving me nuts.
I've implemented a solution in code but I'd really like to see if I can get this working in native Postgres.
At this point I'm monkeying around with a new test database schema, and this is my view - the adding unique to the index generates an error, and I can't add a check constraint to a materialised view (grrrr)
You can try joining sub query to ensure distinct record from user table.
select * from any_table t1
inner join(
select distinct userid,username from source_table
) t2 on t1.userid=t2.userid;

Representing tree structure in kdb

How would I represent the below tree structure with the values in each node in kdb?
a : 4
b : 3
c : 1
d : 2
e : 7
f : 5
g : 2
I would need to setup a function to sum the values at the nodes too.
Any tips appreciated.
There are different approaches you can try.
For ex: TreeTable: table having parent and child column.
A treetable is a table with additional properties.
Firstly, the records of the table are related hierarchically. Thus, a record may have one or more child-records, which may in turn have children. If a record has a parent, it has exactly one. A record without a parent is called a root record. A record without any children is called a leaf record. A record with children is called a node record.
Following paper explains TreeTable. http://archive.vector.org.uk/art10500340
A nested-dictionary approach would be:
q)dict.a.b.c:1
q)dict.a.b.d:2
q)dict.a.b[`]:3
q)dict.a[`]:4
q)dict.a.e.f.g:2
q)dict.a.e.f[`]:5
q)dict.a.e[`]:7
/note that it is important that each node is defined starting at the deepest nodes and working backwards to the top level nodes.
/to see the hierarchy use dict.a or dict.a.e etc, or more dynamically use
q)#/[dict;`a]
| 4
b| ``c`d!3 1 2
e| ``f!(7;``g!5 2)
/to get the values at individual nodes
q)first #/[dict;`a`e`f]
5
q)first #/[dict;`a`e`f`g]
2
/to find all values under a node
q){raze $[99h=type x;.z.s each x;x]}dict.a
4 3 1 2 7 5 2
q){raze $[99h=type x;.z.s each x;x]}#/[dict;`a`e]
7 5 2
/to sum all values under a node
q)sum {raze $[99h=type x;.z.s each x;x]}dict.a
24
q)sum {raze $[99h=type x;.z.s each x;x]}#/[dict;`a`e]
14
Obviously these can all be wrapped up into nice neat functions if necessary.
IMO it depends on what how you plan to "query" the tree, and what you want out of it - can you elaborate further? Do you want sub-trees? E.g. if you query for "a" will it give you "4" OR the whole sub-tree (in this case it's the whole tree).
If memory isn't an issue, or your tree is small, you could could have either a nested dictionary of dictionaries, with a function that recurses inside the dictionaries, or you could have the letter symbols as the key of a single dictionary:
q)d:(enlist enlist `a)!enlist 4
q)d,:(enlist `a`b)!enlist 3
q)d,:(enlist `a`b`c)!enlist 1
q)d
,`a | 4
`a`b | 3
`a`b`c| 1
q)d`a
4
q)d`a`b
3
q)d`a`b`c
1

How to pick items from warehouse to minimise travel in TSQL?

I am looking at this problem from a TSQL point of view, however any advice would be appreciated.
Scenario
I have 2 sets of criteria which identify items in a warehouse to be selected.
Query 1 returns 100 items
Query 2 returns 100 items
I need to pick any 25 of the 100 items returned in query 1.
I need to pick any 25 of the 100 items returned in query 2.
- The items in query 1/2 will not be the same, ever.
Each item is stored in a segment of the warehouse.
A segment of the warehouse may contain numerous items.
I wish to select the 50 items (25 from each query) in a way as to reduce the number of segments I must visit to select the items.
Suggested Approach
My initial idea has been to combined the 2 result sets and produce a list of
Segment ID, NumberOfItemsRequiredInSegment
I would then select 25 items from each query, giving preference to those in a segments with the most NumberOfItemsRequiredInSegment. I know this would not be optimal but would be an easy to implement heuristic.
Questions
1) I suspect this is a standard combinational problem, but I don't recognise it.. perhaps multiple knapsack, does anyone recognise it?
2) Is there a better (easy-ish to impliment) heuristic or solution - ideally in TSQL?
Many thanks.
This might also not be optimal but i think would at least perform fairly well.
Calculate this set for query 1.
Segment ID, NumberOfItemsRequiredInSegment
take the top 25, Just by sorting by NumberOfItemsRequiredInSegment. call this subset A.
take the top 25 from query 2, by joining to A and sorting by "case when A.segmentID is not null then 1 else 0, NumberOfItemsRequiredInSegmentFromQuery2".
repeat this but take the top 25 from query 2 first. return the better performing of the 2 sets.
The one scenario where i think this fails would be if you got something like this.
Segment Count Query 1 Count Query 2
A 10 1
B 5 1
C 5 1
D 5 4
E 5 4
F 4 4
G 4 5
H 1 5
J 1 5
K 1 10
You need to make sure you choose A, D, E, from when choosing the best segments from query 1. To deal with this you'd almost still need to join to query two, so you can get the count from there to use as a tie breaker.

Calculating change in leaders for baseball stats in MSSQL

Imagine I have a MSSQL 2005 table(bbstats) that updates weekly showing
various cumulative categories of baseball accomplishments for a team
week 1
Player H SO HR
Sammy 7 11 2
Ted 14 3 0
Arthur 2 15 0
Zach 9 14 3
week 2
Player H SO HR
Sammy 12 16 4
Ted 21 7 1
Arthur 3 18 0
Zach 12 18 3
I wish to highlight textually where there has been a change in leader for each category
so after week 2 there would be nothing to report on hits(H); Zach has joined Arthur with most strikeouts(SO) at
18; and Sammy is new leader in homeruns(HR) with 4
So I would want to set up a process something like
a) save the past data(week 1) as table bbstatsPrior,
b) updates the bbstats for the new results - I do not need assistance with this
c) compare between the tables for the player(s with ties) with max value for each column
and spits out only where they differ
d) move onto next column and repeat
In any real world example there would be significantly more columns to calculate for
Thanks
Responding to Brents comments, I am really after any changes in the leaders for each category
So I would have something like
select top 1 with ties player
from bbstatsPrior
order by H desc
and
select top 1 with ties player,H
from bbstats
order by H desc
I then want to compare the player from each query (do I need to do temp tables) . If they differ I want to output the second select statement. For the H category Ted is leader `from both tables but for other categories there are changes between the weeks
I can then loop through the columns using
select name from sys.all_columns sc
where sc.object_id=object_id('bbstats') and name <>'player'
If the number of stats doesn't change often, you could easily just write a single query to get this data. Join bbStats to bbStatsPrior where bbstatsprior.week < bbstats.week and bbstats.week=#weekNumber. Then just do a simple comparison between bbstats.Hits to bbstatsPrior.Hits to get your difference.
If the stats change often, you could use dynamic SQL to do this for all columns that match a certain pattern or are in a list of columns based on sys.columns for that table?
You could add a column for each stat column to designate the leader using a correlated subquery to find the max value for that column and see if it's equal to the current record.
This might get you started, but I'd recommend posting what you currently have to achieve this and the community can help you from there.