Design question for a table with too many joins OR polymorphic relations in Postgres 11.7 - postgresql

I've been given a table that I'm not sure how to design. I'm hoping for some design suggestions, or pointers in the right direction. The table is called edge and is meant to store some event traces, and IDs that link out to a host of possible lookup tables. Leaving out everything but IDs, here's what the table contains, all UUIDs:
ID
InvID
OrgID
FacilityID
FromAssemblyID
FromAssociatedTo
FromAssociatedToID
FromClinicID
FromFacilityDepartmentID
FromFacilityID
FromFacilityLocationID
FromScanAtFacilityID
FromScanID
FromSCaseID
FromSterilizerLoadID
FromWasherLoadID
FromWebUserID
ToAssemblyID
ToAssociatedTo
ToAssociatedToID
ToClinicID
ToFacilityDepartmentID
ToFacilityID
ToFacilityLocationID
ToNodeDTS
ToScanAtFacilityID
ToScanID
ToSCaseID
ToSterilizerLoadID
ToUserName
ToWasherLoadID
ToWebUserID
That's an overwhelming number of IDs to possibly join on. I remember reading that the Postgres planner kind of gives up when you've got a dozen+ joins. The idea being that there are so many permutations to explore, that the planning time could quickly overwhelm the query time. If you boil it down, the "from" and "to" links are only ever going to have one key value across all of those fields. So, implemented as a polymorphic/promiscuous relations, something like this:
ID
InvID
OrgID
FacilityID
FromID
FromType
ToID
ToType
ToWebUserID
This table is going to be ginormous, so speed is/will be a consideration.
I encouraged the author not to use a polymorphic design, although the appeal is obvious. (I like Karwin's SQL Antipatterns book.) But now, confronted with nearly three dozen IDs, I'm a bit stumped.
Is there a common solution to this kind of problem? Namely, where you've got a central table like this with connections to a wide variety of possible tables? I don't have a Data Warehousing background, but this looks somewhat like that. (The author of this table has read Kimball's books, but not done any Data Warehouse implementations either.)
Important: We're using JOIN to do lookups on related values that might change, we're not using it to change the size of the result set. Just pretend it would always be LEFT JOIN.
With that in mind, what I've thought of is to skip joining on the From and To IDs, and instead use custom function calls to look up required values from the related tables. like (pseudo-code)
GetUserName(uuid) : citext
...and os on for other values of interest in this and other tables...
The function would return '' when the UUID is 0000etc.
I appreciate that this isn't the crispest question in the history of SO, and I what I'm hoping for pointers in a fruitful direction.

This smacks of “premature optimization” (which is a source of evil) based on something that you “remember reading”, so maybe some enlightenment about join optimization will help.
One rule of thumb that I follow in questions like this is to model things so that your queries become simple and natural. Experience shows that that often leads to good performance.
I assume that the table you show is the fact table of a star schema, and the foreign keys point to the many dimension tables, so that your query will look like
SELECT ...
FROM fact
JOIN dim1 ON fact.dim1_id = dim1.id
JOIN dim2 ON fact.dim3_id = dim2.id
JOIN dim3 ON fact.dim3_id = dim3.id
...
WHERE dim1.col1 = ...
AND dim2.col2 BETWEEN ... AND ...
AND dim3.col3 < ...
...
Now PostgreSQL will by default only consider all join permutations of the first eight tables (join_collapse_limit), and the rest of the tables are just joined in the order in which they appear in the query.
Moreover, if the number of tables reaches the threshold of 12 (geqo_threshold), the genetic query optimizer takes over, a component that simulates evolution by mutation and survival of the fittest with randomly chosen execution plans (really!) and consequently doesn't always come up with the same execution plan for the same query.
So my advice would be to write the queries in a way that the first seven dimension tables are the ones with the biggest chance of reducing the number of result rows most significantly (based on the WHERE conditions). You can also increase join_collapse_limit, because if your queries take a long time to run anyway, you can easily afford the planner to spend more time thinking about the best plan.
Then you'd set geqo = off to disable the genetic query optimizer.
If you design your queries according to these principles, you should be able to get good execution plans without messing up the data model.

Related

Active Record efficient querying on multiple different tables

Let me give a summary of what I've been attempting to do and the efficiency issues I've been running into:
Essentially I want my users to be able to select parameters to filter data from my database, then I want to pass relevant data which passes those filters from the controller.
However, these filters query on data from multiple different tables (that is, about 5-6 different tables), some of which are quite large (as in 100k+ rows). These tables are all related to what I want to show, e.g. Here is a bond that meets so and so criteria, which is issued by so and so issuer, which must meet these criteria, and so on.
From an end result, I only really need about 100 rows after querying based on the parameters given by the user, but it feels like I need to look at everything in every table because I dont know how strict the filters will be beforehand. e.g. With a starting universe of 100k sets of data, passing filter f1,f2 of Table 1 might leave 90k, but after passing through filter f3 of table 2, f4,f5,f6 of table 3, and so ..., we might end up with 100 or less sets of data that pass these parameters because the last filters checked might be quite strict.
How can I go about querying through these multiple different tables efficiently?
Doing a join between them seems like it'd yield some time complexity of |T_1||T_2||T_3||T_4||T_5||T_6| where T_i is the "size" of table i.
On the other hand, just looking through the other tables based off the ids of the ones that pass the previous filter (as in, id 5,7,8 pass filters in T_1, which of those ids then pass filters in T_2, then which of those pass filters in T_3 and so on) looks like it might(?) have time complexity of |T_1| + |T_2| + ... + |T_6|.
I'm relatively new to Ruby on Rails, so im not entirely sure all of the tools at my disposal that could help with optimizing this, but at the same time I'm not entirely sure how to best approach this algorithmically.

Why to create empty (no rows, no columns) table in PostgreSQL

In answer to this question I've learned that you can create empty table in PostgreSQL.
create table t();
Is there any real use case for this? Why would you create empty table? Because you don't know what columns it will have?
These are the things from my point of view that a column less table is good for. They probably fall more into the warm and fuzzy category.
1.
One practical use of creating a table before you add any user
defined columns to it is that it allows you to iterate fast when
creating a new system or just doing rapid dev iterations in general.
2.
Kind of more of 1, but lets you stub out tables that your app logic or procedure can make reference too, even if the columns have
yet to
be put in place.
3.
I could see it coming in handing in a case where your at a big company with lots of developers. Maybe you want to reserve a name
months in advance before your work is complete. Just add the new
column-less table to the build. Of course they could still high
jack it, but you may be able to win the argument that you had it in
use well before they came along with their other plans. Kind of
fringe, but a valid benefit.
All of these are handy and I miss them when I'm not working in PostgreSQL.
I don't know the precise reason for its inclusion in PostgreSQL, but a zero-column table - or rather a zero-attribute relation - plays a role in the theory of relational algebra, on which SQL is (broadly) based.
Specifically, a zero-attribute relation with no tuples (in SQL terms, a table with no columns and no rows) is the relational equivalent of zero or false, while a relation with no attributes but one tuple (SQL: no columns, but one row, which isn't possible in PostgreSQL as far as I know) is true or one. Hugh Darwen, an outspoken advocate of relational theory and critic of SQL, dubbed these "Table Dum" and "Table Dee", respectively.
In normal algebra x + 0 == x and x * 0 == 0, whereas x * 1 == x; the idea is that in relational algebra, Table Dum and Table Dee can be used as similar primitives for joins, unions, etc.
PostgreSQL internally refers to tables (as well as views and sequences) as "relations", so although it is geared around implementing SQL, which isn't defined by this kind of pure relation algebra, there may be elements of that in its design or history.
It is not empty table - only empty result. PostgreSQL rows contains some invisible (in default) columns. I am not sure, but it can be artifact from dark age, when Postgres was Objected Relational database - and PG supported language POSTQUEL. This empty table can work as abstract ancestor in class hierarchy.
List of system columns
I don't think mine is the intended usage however recently I've used an empty table as a lock for a view which I create and change dynamically with EXECUTE. The function which creates/replace the view has ACCESS EXCLUSIVE on the empty table and the other functions which uses the view has ACCESS.

tsql join best practices advice

First of all, English it's not my first language, feel free to edit my question and I'm sorry for any mistakes that can offend you or not being so clear exposing the problem.
I have a few sql queries with lots of joins, these joins are based on clustered index (no worries about that). Some of the joins are used only to respect normalization and because is intuitive to maintenance, but sometimes it's possible to skip some of then. It's not clear to me what to do about these joins in terms of best practices.
Edit:
A simple example:
select *
from things
join things_categories on
things_categories.id_thing = things.id_thing
join categories on
categories.id_category = things_categories.id_category
join categories_properties on
categories_properties.id_category = categories.id_category
where
categories_properties.bo_default = 1
But it's possible to do:
select *
from things
join things_categories on
things_categories.id_thing = things.id_thing
join categories_properties on
categories_properties.id_category = things_categories.id_category
where
categories_properties.bo_default = 1
The second join it's not necessary (I do have integrity at database level), it's there only because makes the code more intuitive and respect the database normalization. I'm not sure if I should follow the smallest possible and efficient path or leave unnecessary joins to respect normalization and make the code more intuitive.
Any tips?
All the best.
It deppends, wheter you've or not integrity already.
In one hand, if the categories_properties table has a foreign key in the id_category column, then the integrity exists and you don't need to make the join with the categories table.
On the other hand, if the integrity might not exist (i.e.: there are id_categories in categories_properties table that are not defined in categories table), then you should make the join.
The join:
join categories on
categories.id_category = things_categories.id_category
is very necessary, since the categories table is used in the next join:
join categories_properties on
categories_properties.id_category = categories.id_category
So it's definitely required, if it's not already defined, as SQL requires for you to establish the links it needs to index and join one to the next.
What is however very painful, is the select *.
You don't need all that info, since * will bring all data from all tables.
Perhaps you could specify what you need from each table or, at worst, use things.* to specify all columns of a specific table.
If you do not need a join do not use it. You are taking a totally unneeded performance hit. Don't force the database to do work it doesn't need to do because you think it looks more comlete, you should consider performance ahead of readability in a query. After all once you start writing performant SQl code, it will become more readable to you. However, make sure you actually don't need it before eliminating it by making sure both versions of the query return the same result set.

How to do a SQL query using columns from a related table?

I've got three related SQL tables, simplified they look like this:
ShopTable
[ShopID]
ShelfTable
[ShelfID]
[ShopID]
InventoryTable
[ShelfID]
[Value]
[ShopID] and [ShelfID] are relations. Now what I want to do is get the SUM of [Value] for one [ShopID], but this obviously won't work since [ShopID] ain't part of InventoryTable:
SELECT SUM([Value]) WHERE [ShopID] = '1'
How do I have to write the query to filter the InventoryTable using the ShopID?
SELECT SUM(i.value)
FROM shelfTable s
JOIN inventoryTable i
ON i.shelfId = s.shelfId
WHERE s.shopId = 1
This is a fundamental question about relations between tables, so I'll provide some detail, hoping that you can use some of these ideas when writing SQL queries in the future.
Let's start with one basic thing first. [ShopID] could refer to two different but related columns, one in [ShopTable] and one in [ShelfTable]. The same things applies to [ShelfID]. It's useful to always specify the table.
You describe [ShopID] and [ShelfID] as "relations." As Damien_The_Unbeliever has commented, those columns are, in fact, two pairs of primary and foreign keys. That is, [ShelfTable].[ShelfID] identifies a "shelf" record, and [InventoryTable].[ShelfID] relates an "inventory item" (whatever that is) to a "shelf." (It's not always possible to interpret rows in a database this naively, but I'm willing to guess I'm not too far off from reality.)
Likewise, each "shelf" belongs to one "shop," and [ShelfTable].[ShopID] refers to that specific "shop." Notice that because we have the value of [ShopID] already (I'll call it "#MyShopID"), we don't even need the [ShopTable] here. We can just use [ShelfTable].[ShopID] to filter for the "shelves" we're interested in.
You're asking to get the sum total of [InventoryTable].[Value] for one [ShopID] value, but [ShopID] doesn't show up in [InventoryTable]. That's where your (inner) join comes into play. You know that you'll be adding up values from [InventoryTable], but you've got to specify the particular "shop." You specify #MyShopID for [ShelfTable].[ShelfID], which will do your filtering in [InventoryTable] for you.
One final thing before composing the query. I'm assuming that you haven't oversimplified your tables too much, and that [Value] is the total value of each "inventory item," and not just a unit value. If it wasn't, we'd have to multiply values by quantities, etc., but I'll let you check your own work here.
So, here's what we do:
We select FROM the [InventoryTable]
but we INNER JOIN to the [ShelfTable] on [ShelfID] from both tables
and we only want "shelves" from one "shop," i.e. WHERE [ShelfTable].[ShopID] = #MyShopID
and then we SELECT the SUM([InventoryTable].[Value])
and we're done. In SQL, let's remove the brackets, provide some table aliases, and we'll get a query that looks like this:
SELECT SUM(inv.Value)
FROM InventoryTable AS inv
INNER JOIN ShelfTable AS shf ON shf.ShelfID = inv.ShelfID
WHERE shf.ShopID = #MyShopID
;
Here are a few take-away points to consider. Notice we handled the FROM clause first. You'll always want to do that.
You'll also want a "driving table" to start with, in this case, [InventoryTable]. The other tables in your join add extra information and provide you a means to filter, but don't otherwise interfere with your summing up. More complex queries don't offer such an obvious luxury, but we're not getting too fancy here.
You'll also note, just briefly, that because [ShelfID] is a primary key in [ShelfTable], those [ShelfID]'s are unique values in [ShelfTable], and so each "inventory" thing belongs to a single "shelf." So the join won't cause us to double-count values. That's a good thing to remember when you're not dealing with primary and foreign keys, like we're doing here.
Hope that helps. And I hope I didn't come across as too pedantic.

T-SQL speed comparison between LEFT() vs. LIKE operator

I'm creating result paging based on first letter of certain nvarchar column and not the usual one, that usually pages on number of results.
And I'm not faced with a challenge whether to filter results using LIKE operator or equality (=) operator.
select *
from table
where name like #firstletter + '%'
vs.
select *
from table
where left(name, 1) = #firstletter
I've tried searching the net for speed comparison between the two, but it's hard to find any results, since most search results are related to LEFT JOINs and not LEFT function.
"Left" vs "Like" -- one should always use "Like" when possible where indexes are implemented because "Like" is not a function and therefore can utilize any indexes you may have on the data.
"Left", on the other hand, is function, and therefore cannot make use of indexes. This web page describes the usage differences with some examples. What this means is SQL server has to evaluate the function for every record that's returned.
"Substring" and other similar functions are also culprits.
Your best bet would be to measure the performance on real production data rather than trying to guess (or ask us). That's because performance can sometimes depend on the data you're processing, although in this case it seems unlikely (but I don't know that, hence why you should check).
If this is a query you will be doing a lot, you should consider another (indexed) column which contains the lowercased first letter of name and have it set by an insert/update trigger.
This will, at the cost of a minimal storage increase, make this query blindingly fast:
select * from table where name_first_char_lower = #firstletter
That's because most database are read far more often than written, and this will amortise the cost of the calculation (done only for writes) across all reads.
It introduces redundant data but it's okay to do that for performance as long as you understand (and mitigate, as in this suggestion) the consequences and need the extra performance.
I had a similar question, and ran tests on both. Here is my code.
where (VOUCHER like 'PCNSF%'
or voucher like 'PCLTF%'
or VOUCHER like 'PCACH%'
or VOUCHER like 'PCWP%'
or voucher like 'PCINT%')
Returned 1434 rows in 1 min 51 seconds.
vs
where (LEFT(VOUCHER,5) = 'PCNSF'
or LEFT(VOUCHER,5)='PCLTF'
or LEFT(VOUCHER,5) = 'PCACH'
or LEFT(VOUCHER,4)='PCWP'
or LEFT (VOUCHER,5) ='PCINT')
Returned 1434 rows in 1 min 27 seconds
My data is faster with the left 5. As an aside my overall query does hit some indexes.
I would always suggest to use like operator when the search column contains index. I tested the above query in my production environment with select count(column_name) from table_name where left(column_name,3)='AAA' OR left(column_name,3)= 'ABA' OR ... up to 9 OR clauses. My count displays 7301477 records with 4 secs in left and 1 second in like i.e where column_name like 'AAA%' OR Column_Name like 'ABA%' or ... up to 9 like clauses.
Calling a function in where clause is not a best practice. Refer http://blog.sqlauthority.com/2013/03/12/sql-server-avoid-using-function-in-where-clause-scan-to-seek/
Entity Framework Core users
You can use EF.Functions.Like(columnName, searchString + "%") instead of columnName.startsWith(...) and you'll get just a LIKE function in the generated SQL instead of all this 'LEFT' craziness!
Depending upon your needs you will probably need to preprocess searchString.
See also https://github.com/aspnet/EntityFrameworkCore/issues/7429
This function isn't present in Entity Framework (non core) EntityFunctions so I'm not sure how to do it for EF6.