Recursive CTE or self joins with circular use case - tsql

I have two tables:
CompanyCases is a table of companies and case numbers like this:
CompanyId, CaseNumber
CaseRelations is a table of cases and their related cases like this:
CaseNumber, RelatedCase
There can be several companies related to one case, and one case can be related to several companies.
I need a query that will give me all related cases for a company id. The trick is that when I find the related case, that can also have a related case, which can have a related case etc.
My first assumption was that it would not be that deep, so I could just do self joins like:
Select
cc.CompanyId,
cc.CaseNumber,
CR1.CaseNumber,
CR1.RelatedCase,
CR2.CaseNumber,
CR2.RelatedCase
FROM CompanyCases cc
LEFT JOIN CaseRelations CR1 ON CR1.CaseNumber = cc.CaseNumber
LEFT JOIN CaseRelations CR2 ON CR2.CaseNumber = CR1.RelatedCase
And then keep joining as many levels as is needed. The problem is that the cases loop. So it can go like this:
CaseNumber RelatedCase
1 2
2 3
3 1
So I can keep joining forever without reaching a full column of nulls. Also it is at least 5 levels deep so this is not a great solution. I don't mind using recursive CTEs either but I think I will get the same problem with the circular cases.
I hope I described it well enough - Does anyone know how to solve this?
Thanks in advance :)

Related

Design question for a table with too many joins OR polymorphic relations in Postgres 11.7

I've been given a table that I'm not sure how to design. I'm hoping for some design suggestions, or pointers in the right direction. The table is called edge and is meant to store some event traces, and IDs that link out to a host of possible lookup tables. Leaving out everything but IDs, here's what the table contains, all UUIDs:
ID
InvID
OrgID
FacilityID
FromAssemblyID
FromAssociatedTo
FromAssociatedToID
FromClinicID
FromFacilityDepartmentID
FromFacilityID
FromFacilityLocationID
FromScanAtFacilityID
FromScanID
FromSCaseID
FromSterilizerLoadID
FromWasherLoadID
FromWebUserID
ToAssemblyID
ToAssociatedTo
ToAssociatedToID
ToClinicID
ToFacilityDepartmentID
ToFacilityID
ToFacilityLocationID
ToNodeDTS
ToScanAtFacilityID
ToScanID
ToSCaseID
ToSterilizerLoadID
ToUserName
ToWasherLoadID
ToWebUserID
That's an overwhelming number of IDs to possibly join on. I remember reading that the Postgres planner kind of gives up when you've got a dozen+ joins. The idea being that there are so many permutations to explore, that the planning time could quickly overwhelm the query time. If you boil it down, the "from" and "to" links are only ever going to have one key value across all of those fields. So, implemented as a polymorphic/promiscuous relations, something like this:
ID
InvID
OrgID
FacilityID
FromID
FromType
ToID
ToType
ToWebUserID
This table is going to be ginormous, so speed is/will be a consideration.
I encouraged the author not to use a polymorphic design, although the appeal is obvious. (I like Karwin's SQL Antipatterns book.) But now, confronted with nearly three dozen IDs, I'm a bit stumped.
Is there a common solution to this kind of problem? Namely, where you've got a central table like this with connections to a wide variety of possible tables? I don't have a Data Warehousing background, but this looks somewhat like that. (The author of this table has read Kimball's books, but not done any Data Warehouse implementations either.)
Important: We're using JOIN to do lookups on related values that might change, we're not using it to change the size of the result set. Just pretend it would always be LEFT JOIN.
With that in mind, what I've thought of is to skip joining on the From and To IDs, and instead use custom function calls to look up required values from the related tables. like (pseudo-code)
GetUserName(uuid) : citext
...and os on for other values of interest in this and other tables...
The function would return '' when the UUID is 0000etc.
I appreciate that this isn't the crispest question in the history of SO, and I what I'm hoping for pointers in a fruitful direction.
This smacks of “premature optimization” (which is a source of evil) based on something that you “remember reading”, so maybe some enlightenment about join optimization will help.
One rule of thumb that I follow in questions like this is to model things so that your queries become simple and natural. Experience shows that that often leads to good performance.
I assume that the table you show is the fact table of a star schema, and the foreign keys point to the many dimension tables, so that your query will look like
SELECT ...
FROM fact
JOIN dim1 ON fact.dim1_id = dim1.id
JOIN dim2 ON fact.dim3_id = dim2.id
JOIN dim3 ON fact.dim3_id = dim3.id
...
WHERE dim1.col1 = ...
AND dim2.col2 BETWEEN ... AND ...
AND dim3.col3 < ...
...
Now PostgreSQL will by default only consider all join permutations of the first eight tables (join_collapse_limit), and the rest of the tables are just joined in the order in which they appear in the query.
Moreover, if the number of tables reaches the threshold of 12 (geqo_threshold), the genetic query optimizer takes over, a component that simulates evolution by mutation and survival of the fittest with randomly chosen execution plans (really!) and consequently doesn't always come up with the same execution plan for the same query.
So my advice would be to write the queries in a way that the first seven dimension tables are the ones with the biggest chance of reducing the number of result rows most significantly (based on the WHERE conditions). You can also increase join_collapse_limit, because if your queries take a long time to run anyway, you can easily afford the planner to spend more time thinking about the best plan.
Then you'd set geqo = off to disable the genetic query optimizer.
If you design your queries according to these principles, you should be able to get good execution plans without messing up the data model.

JOIN versus subquery - performance sqlite3

So I got 3 tables:
Locations (LocationID [PK])
Track_Locations (TrackID [FK], LocationID [FK])
Tracks (TrackID [PK])
I'm building an app for iPhone and I was told that a JOIN would probably work faster then a subquery would. Is this true?
I have searched around but can't find a straight answer to this and would like to know what is best to use in this case.
My queries in both cases would look like this:
Subquery:
SELECT LocationID, TimestampGPS, Longitude, Latitude, ...
FROM locations
WHERE LocationID
IN (SELECT LocationID FROM track_locations WHERE TrackID = ?)
Using JOIN:
SELECT l.LocationID, l.TimestampGPS, l.Longitude, l.Latitude, ...
FROM locations l
JOIN track_locations tl
ON (l.LocationID = tl.LocationID)
JOIN tracks t
ON (t.TrackID = tl.TrackID)
WHERE t.TrackID = ?
A JOIN is almost always faster than a subquery, because it results in a single query, whereas a subquery is two queries. Some databases, however, know how to optimize such simple subqueries into JOINs, though I doubt SQLite does.
That said, sometimes, given the structure of one's data, it is faster to do a subquery. The only real way to find out is to load your database with a likely collection of sample data from the wild and benchmark both approaches.
My suspicion is that it will be a wash, however. Most iPhone apps don't have enough rows in the database to make much difference.

tsql join best practices advice

First of all, English it's not my first language, feel free to edit my question and I'm sorry for any mistakes that can offend you or not being so clear exposing the problem.
I have a few sql queries with lots of joins, these joins are based on clustered index (no worries about that). Some of the joins are used only to respect normalization and because is intuitive to maintenance, but sometimes it's possible to skip some of then. It's not clear to me what to do about these joins in terms of best practices.
Edit:
A simple example:
select *
from things
join things_categories on
things_categories.id_thing = things.id_thing
join categories on
categories.id_category = things_categories.id_category
join categories_properties on
categories_properties.id_category = categories.id_category
where
categories_properties.bo_default = 1
But it's possible to do:
select *
from things
join things_categories on
things_categories.id_thing = things.id_thing
join categories_properties on
categories_properties.id_category = things_categories.id_category
where
categories_properties.bo_default = 1
The second join it's not necessary (I do have integrity at database level), it's there only because makes the code more intuitive and respect the database normalization. I'm not sure if I should follow the smallest possible and efficient path or leave unnecessary joins to respect normalization and make the code more intuitive.
Any tips?
All the best.
It deppends, wheter you've or not integrity already.
In one hand, if the categories_properties table has a foreign key in the id_category column, then the integrity exists and you don't need to make the join with the categories table.
On the other hand, if the integrity might not exist (i.e.: there are id_categories in categories_properties table that are not defined in categories table), then you should make the join.
The join:
join categories on
categories.id_category = things_categories.id_category
is very necessary, since the categories table is used in the next join:
join categories_properties on
categories_properties.id_category = categories.id_category
So it's definitely required, if it's not already defined, as SQL requires for you to establish the links it needs to index and join one to the next.
What is however very painful, is the select *.
You don't need all that info, since * will bring all data from all tables.
Perhaps you could specify what you need from each table or, at worst, use things.* to specify all columns of a specific table.
If you do not need a join do not use it. You are taking a totally unneeded performance hit. Don't force the database to do work it doesn't need to do because you think it looks more comlete, you should consider performance ahead of readability in a query. After all once you start writing performant SQl code, it will become more readable to you. However, make sure you actually don't need it before eliminating it by making sure both versions of the query return the same result set.

How to do a SQL query using columns from a related table?

I've got three related SQL tables, simplified they look like this:
ShopTable
[ShopID]
ShelfTable
[ShelfID]
[ShopID]
InventoryTable
[ShelfID]
[Value]
[ShopID] and [ShelfID] are relations. Now what I want to do is get the SUM of [Value] for one [ShopID], but this obviously won't work since [ShopID] ain't part of InventoryTable:
SELECT SUM([Value]) WHERE [ShopID] = '1'
How do I have to write the query to filter the InventoryTable using the ShopID?
SELECT SUM(i.value)
FROM shelfTable s
JOIN inventoryTable i
ON i.shelfId = s.shelfId
WHERE s.shopId = 1
This is a fundamental question about relations between tables, so I'll provide some detail, hoping that you can use some of these ideas when writing SQL queries in the future.
Let's start with one basic thing first. [ShopID] could refer to two different but related columns, one in [ShopTable] and one in [ShelfTable]. The same things applies to [ShelfID]. It's useful to always specify the table.
You describe [ShopID] and [ShelfID] as "relations." As Damien_The_Unbeliever has commented, those columns are, in fact, two pairs of primary and foreign keys. That is, [ShelfTable].[ShelfID] identifies a "shelf" record, and [InventoryTable].[ShelfID] relates an "inventory item" (whatever that is) to a "shelf." (It's not always possible to interpret rows in a database this naively, but I'm willing to guess I'm not too far off from reality.)
Likewise, each "shelf" belongs to one "shop," and [ShelfTable].[ShopID] refers to that specific "shop." Notice that because we have the value of [ShopID] already (I'll call it "#MyShopID"), we don't even need the [ShopTable] here. We can just use [ShelfTable].[ShopID] to filter for the "shelves" we're interested in.
You're asking to get the sum total of [InventoryTable].[Value] for one [ShopID] value, but [ShopID] doesn't show up in [InventoryTable]. That's where your (inner) join comes into play. You know that you'll be adding up values from [InventoryTable], but you've got to specify the particular "shop." You specify #MyShopID for [ShelfTable].[ShelfID], which will do your filtering in [InventoryTable] for you.
One final thing before composing the query. I'm assuming that you haven't oversimplified your tables too much, and that [Value] is the total value of each "inventory item," and not just a unit value. If it wasn't, we'd have to multiply values by quantities, etc., but I'll let you check your own work here.
So, here's what we do:
We select FROM the [InventoryTable]
but we INNER JOIN to the [ShelfTable] on [ShelfID] from both tables
and we only want "shelves" from one "shop," i.e. WHERE [ShelfTable].[ShopID] = #MyShopID
and then we SELECT the SUM([InventoryTable].[Value])
and we're done. In SQL, let's remove the brackets, provide some table aliases, and we'll get a query that looks like this:
SELECT SUM(inv.Value)
FROM InventoryTable AS inv
INNER JOIN ShelfTable AS shf ON shf.ShelfID = inv.ShelfID
WHERE shf.ShopID = #MyShopID
;
Here are a few take-away points to consider. Notice we handled the FROM clause first. You'll always want to do that.
You'll also want a "driving table" to start with, in this case, [InventoryTable]. The other tables in your join add extra information and provide you a means to filter, but don't otherwise interfere with your summing up. More complex queries don't offer such an obvious luxury, but we're not getting too fancy here.
You'll also note, just briefly, that because [ShelfID] is a primary key in [ShelfTable], those [ShelfID]'s are unique values in [ShelfTable], and so each "inventory" thing belongs to a single "shelf." So the join won't cause us to double-count values. That's a good thing to remember when you're not dealing with primary and foreign keys, like we're doing here.
Hope that helps. And I hope I didn't come across as too pedantic.

T-SQL Check Why Record Is Missing

If you've got a very complicated SELECT statement and some records aren't included because of a join, what is the easiest way to debug this and find the reasons why?
Change the JOINS / INNER JOINS to OUTER JOINS and look for NULLs where they shouldn't be.
You could include your ON logic in the WHERE clause, phrase it like this:
WHERE 1=1
AND...
AND...
and just comment out as many of the terms until you isolate the unexpected behaviour.
I don't know if this will help you, but when I find myself with a complex Select that I'm having a hard time maintaining, or debugging, I'll break it up into separate common table expressions (CTE's). I've found this makes many of my queries much easier to understand and maintain.
Binary Search stylee:
If you have 10 joins, comment out the last 5.
Still have the problem? comment out the last 2/3 joins that are still uncommented
Still have the problem? comment out the last 1/2 joins that are still uncommented
Do this until you get down to it working, then the problem will lie in the last joins you commented out.
Yes you could do them one at a time but this is usually quicker.
Obviously you will have to comment out all the columns not used in the select statement, but I normally just put /* */ around all the columns, then put a * instead.
Just look at the number of results returned.