Is it possible to return only one instance of each ID in a view? - postgresql

I am trying to work out how I would ensure I only get one instance of each user and their ID when I try to do an inner join on my source table.
The source data is a series of user names and IDs
userid username
1 alice
2 bob
3 charley
4 dave
5 robin
6 jon
7 lou
8 scott
I have had to write the algorithm in python, to make sure I only get one set of user data matches with another (so we can make sure each user's data is used once in each round)
We store the pairings, and how many rounds have been completed successfully after the tests, but I'd like to try and shortcut the process
I can get all the results, but I want to find a better way to remove each matched pair from the results so they can't be matched again.
select u.user_id, u.user, ur.user_id, ur.user
from userResults u inner join userResults ur
on u.user_id < ur.user_id
and (u.user_id, ur.user_id) not in
(select uid, uid2 from rounds)
where u.match <= ur.match and ((u.user_id) not in %s
and ur.user_id not in %s) limit 1;
I've tried making materialised views with a unique constraint, but it doesn't seem to affect it - I get each possible pairing once, rather than each user paired only once
I'm trying to work out how I only get 4 results, in the right order.
Every time I look at the underlying code, I can't help but think there's a better way to write it natively in SQL rather than having to iterate over results in python.
edit
assuming each user has been matched 0 or more times, you might have a situtation where user_id's 1-4 have rounds set to 1, and matches set to 1, and the remaining 4 have rounds set to 1 and no matches.
I have a view which will return a default value of 0 and 0 for rounds and matches if they haven't yet played, and you can't assume all rounds entered have met with a match.
If the first 4 have all matched, and have generated rounds, user 1 and user 2 have already met and matched in a round, so they won't be matched again, so user 1 will match user 3 (or 4) and user 2 will match user 4 (or 3)
The issue I'm having is that when I remove limit, and iterate through manually - the first three matches I always get are: 2,4 then 1,3, then 2,3 (rather than 5,7 or 6,8)
Adding the sample data and current rounds
table rounds
uid uid2
1 2
3 4
userresults view
user_id user rounds score
1 alice 1 0
2 bob 1 1
3 charley 1 1
4 dave 1 0
5 robin 0 0
6 jon 0 0
7 lou 0 0
8 scott 0 0
I'm currently getting results like:
2,4
2,3
1,3
1,4
4,6
...
These are all valid results, but I would like to limit them to a single instance of each ID in each column, so just the first match of each valid pairing.
I've created a new view to try and simplify things a but, and populated it with dummy data and tried to generate matches
All these matches are valid, and I'm trying to add some form of filter or restriction to bring it back to sensible numbers.
777;"Bongo Steveson";779;"Drongo Mongoson"
777;"Bongo Steveson";782;"Cheesey McHamburger"
777;"Bongo Steveson";780;"Buns Mattburger"
779;"Drongo Mongoson";782;"Cheesey McHamburger"
779;"Drongo Mongoson";781;"Hamburgler Bunburger"
775;"Bob Jobsworth";777;"Bongo Steveson"
778;"Mongo Bongoson";779;"Drongo Mongoson"
775;"Bob Jobsworth";778;"Mongo Bongoson"
778;"Mongo Bongoson";781;"Hamburgler Bunburger"
775;"Bob Jobsworth";782;"Cheesey McHamburger"
775;"Bob Jobsworth";781;"Hamburgler Bunburger"
775;"Bob Jobsworth";780;"Buns Mattburger"
776;"Steve Bobson";777;"Bongo Steveson"
776;"Steve Bobson";779;"Drongo Mongoson"
776;"Steve Bobson";782;"Cheesey McHamburger"
776;"Steve Bobson";778;"Mongo Bongoson"
776;"Steve Bobson";781;"Hamburgler Bunburger"
780;"Buns Mattburger";782;"Cheesey McHamburger"
780;"Buns Mattburger";781;"Hamburgler Bunburger"
I still can't work out a sensible way to restrict these values, and it's driving me nuts.
I've implemented a solution in code but I'd really like to see if I can get this working in native Postgres.
At this point I'm monkeying around with a new test database schema, and this is my view - the adding unique to the index generates an error, and I can't add a check constraint to a materialised view (grrrr)

You can try joining sub query to ensure distinct record from user table.
select * from any_table t1
inner join(
select distinct userid,username from source_table
) t2 on t1.userid=t2.userid;

Related

How to for loop in Tableau when you have a combined primary key

Hi Everybody and thanks in advance to whom will answer my question:
I think I need a for loop in tableau and looking for a working around;
my table has the following structure
id, id_detail, result
1 1 fail
1 2 pass
1 3 pass
2 1 pass
2 2 pass
3 1 fail
3 2 pass
...
...
I need to assign to id=1 fail; id=2 pass; id=3 fail
do You have any suggestions?
You may use the alphabetic order where pass is greater than fail
create this calculated field
{FIXED [Product ID]: MIN([Test Evaluation])}
and you'll get what you want

PySpark - if ID exists in Table 1 & Table 2, fill column in table 1 with TRUE else FALSE

I am aiming to merge two tables together, on ID. Some of table 2 won't have all the ID's of table 1 and that's OK - in those cases I would like to fill a 3rd column with true or false values based on whether the IDs existed in both tables.
Let me try to articulate my problem.
Let's say I have two tables:
ID TIME ID TIME
1 1/1/21 2 1/1/21
4 1/1/21 4 1/1/21
7 1/1/21 8 1/1/21
The time table is not important, what is important is ID 4. I would like to fill ID 4 with whether it exists in both tables. The final output would be like so, referencing back to table 1.
ID TIME EXISTS_BOTH_TABLES
1 1/1/21 FALSE
4 1/1/21 TRUE
7 1/1/21 FALSE
I realize this might be a particular type of Join, but my struggle exists also in articulating what exactly I need. I hope this helps you understand my issue.
I haven't tested the following code but you should do something along these lines:
from pyspark.sql.functions import when
df1.join(df2, 'ID', 'left').withColumn('CONVO', when(df2.ID.isNull(), False).otherwise(True))
Left join on the ID and detect when the IDs don't match to fill the convo column with False, else it is true.
If you want to do both sides you could do a full join instead and detect null values for both df1.ID and df2.ID.
EDIT: I believe you've done a mistake in your example, in the result table, ID 7 should be True

How to pick items from warehouse to minimise travel in TSQL?

I am looking at this problem from a TSQL point of view, however any advice would be appreciated.
Scenario
I have 2 sets of criteria which identify items in a warehouse to be selected.
Query 1 returns 100 items
Query 2 returns 100 items
I need to pick any 25 of the 100 items returned in query 1.
I need to pick any 25 of the 100 items returned in query 2.
- The items in query 1/2 will not be the same, ever.
Each item is stored in a segment of the warehouse.
A segment of the warehouse may contain numerous items.
I wish to select the 50 items (25 from each query) in a way as to reduce the number of segments I must visit to select the items.
Suggested Approach
My initial idea has been to combined the 2 result sets and produce a list of
Segment ID, NumberOfItemsRequiredInSegment
I would then select 25 items from each query, giving preference to those in a segments with the most NumberOfItemsRequiredInSegment. I know this would not be optimal but would be an easy to implement heuristic.
Questions
1) I suspect this is a standard combinational problem, but I don't recognise it.. perhaps multiple knapsack, does anyone recognise it?
2) Is there a better (easy-ish to impliment) heuristic or solution - ideally in TSQL?
Many thanks.
This might also not be optimal but i think would at least perform fairly well.
Calculate this set for query 1.
Segment ID, NumberOfItemsRequiredInSegment
take the top 25, Just by sorting by NumberOfItemsRequiredInSegment. call this subset A.
take the top 25 from query 2, by joining to A and sorting by "case when A.segmentID is not null then 1 else 0, NumberOfItemsRequiredInSegmentFromQuery2".
repeat this but take the top 25 from query 2 first. return the better performing of the 2 sets.
The one scenario where i think this fails would be if you got something like this.
Segment Count Query 1 Count Query 2
A 10 1
B 5 1
C 5 1
D 5 4
E 5 4
F 4 4
G 4 5
H 1 5
J 1 5
K 1 10
You need to make sure you choose A, D, E, from when choosing the best segments from query 1. To deal with this you'd almost still need to join to query two, so you can get the count from there to use as a tie breaker.

Calculating change in leaders for baseball stats in MSSQL

Imagine I have a MSSQL 2005 table(bbstats) that updates weekly showing
various cumulative categories of baseball accomplishments for a team
week 1
Player H SO HR
Sammy 7 11 2
Ted 14 3 0
Arthur 2 15 0
Zach 9 14 3
week 2
Player H SO HR
Sammy 12 16 4
Ted 21 7 1
Arthur 3 18 0
Zach 12 18 3
I wish to highlight textually where there has been a change in leader for each category
so after week 2 there would be nothing to report on hits(H); Zach has joined Arthur with most strikeouts(SO) at
18; and Sammy is new leader in homeruns(HR) with 4
So I would want to set up a process something like
a) save the past data(week 1) as table bbstatsPrior,
b) updates the bbstats for the new results - I do not need assistance with this
c) compare between the tables for the player(s with ties) with max value for each column
and spits out only where they differ
d) move onto next column and repeat
In any real world example there would be significantly more columns to calculate for
Thanks
Responding to Brents comments, I am really after any changes in the leaders for each category
So I would have something like
select top 1 with ties player
from bbstatsPrior
order by H desc
and
select top 1 with ties player,H
from bbstats
order by H desc
I then want to compare the player from each query (do I need to do temp tables) . If they differ I want to output the second select statement. For the H category Ted is leader `from both tables but for other categories there are changes between the weeks
I can then loop through the columns using
select name from sys.all_columns sc
where sc.object_id=object_id('bbstats') and name <>'player'
If the number of stats doesn't change often, you could easily just write a single query to get this data. Join bbStats to bbStatsPrior where bbstatsprior.week < bbstats.week and bbstats.week=#weekNumber. Then just do a simple comparison between bbstats.Hits to bbstatsPrior.Hits to get your difference.
If the stats change often, you could use dynamic SQL to do this for all columns that match a certain pattern or are in a list of columns based on sys.columns for that table?
You could add a column for each stat column to designate the leader using a correlated subquery to find the max value for that column and see if it's equal to the current record.
This might get you started, but I'd recommend posting what you currently have to achieve this and the community can help you from there.

Aggregate path counts using HierarchyID

Business problem - understand process fallout using analytics data.
Here is what we have done so far:
Build a dictionary table with every possible process step
Find each process "start"
Find the last step for each start
Join dictionary table to last step to find path to final step
In the final report output we end up with a list of paths for each start to each final step:
User Fallout Step HierarchyID.ToString()
A 1/1/1
B 1/1/1/1/1
C 1/1/1/1
D 1/1/1
E 1/1
What this means is that five users (A-E) started the process. Assume only User B finished, the other four did not. Since this is a simple example (without branching) we want the output to look as follows:
Step Unique Users
1 5
2 5
3 4
4 2
5 1
The easiest solution I could think of is to take each hierarchyID.ToString(), parse that out into a set of subpaths, JOIN back to the dictionary table, and output using GROUP BY.
Given the volume of data, I'd like to use the built-in HierarchyID functions, e.g. IsAncestorOf.
Any ideas or thoughts how I could write this? Maybe a recursive CTE?
Restructuring the data may help with this. For example, structuring the data like this:
User Step Process#
---- ---- --------
A 1 1
A 2 1
A 3 1
B 1 2
B 2 2
B 3 2
B 4 2
B 5 2
E 1 3
E 2 3
E 1 4
E 2 4
E 3 4
Allows you to run the following query:
select step,
count(distinct process#) as process_iterations,
count(distinct user) as unique_users
from stepdata
group by step
order by step;
which returns:
Step Process_Iterations Unique_Users
---- ------------------ ------------
1 4 3
2 4 3
3 3 3
4 1 1
5 1 1
I'm not familiar with hierarchyid, but splitting out that data into chunks for analysis looks like the sort of problem numbers tables are very good for. Join a numbers table against the individual substrings in the fallout and it shouldn't be too hard to treat the whole thing as a table and analyse it on the fly, without any non-set operations.