tsql Table-Valued Function Best Practices - tsql

I work for a public school system and have to create behavior reports for various things. I have about a dozen reports where the data is generated via stored procedures and pretty much retrieve the same base data. I'd like to create a Table-Valued Function so that each report is pulling the same base dataset to work with.
I have a student table that has columns such as schoolYear, Grade, schoolID, etc. The key is enrollmentID. I'm trying to determine which is more efficient, to return all the fields I'll be working with from the student table (schoolYear, grade, etc) or just return the key (enrollmentID) and then join it back with the student table in the query that is calling the function.
On one hand, it seems redundant to join the student table in both the function and the stored procedure that is calling the function. On the other hand, thousands of records are potentially returning and it seems like in the past I've noticed significant lag if I'm trying to return a lot of data from a function.
Is there a best-practice on how to best use table-valued functions? As always your help is greatly appreciated!

Related

Design question for a table with too many joins OR polymorphic relations in Postgres 11.7

I've been given a table that I'm not sure how to design. I'm hoping for some design suggestions, or pointers in the right direction. The table is called edge and is meant to store some event traces, and IDs that link out to a host of possible lookup tables. Leaving out everything but IDs, here's what the table contains, all UUIDs:
ID
InvID
OrgID
FacilityID
FromAssemblyID
FromAssociatedTo
FromAssociatedToID
FromClinicID
FromFacilityDepartmentID
FromFacilityID
FromFacilityLocationID
FromScanAtFacilityID
FromScanID
FromSCaseID
FromSterilizerLoadID
FromWasherLoadID
FromWebUserID
ToAssemblyID
ToAssociatedTo
ToAssociatedToID
ToClinicID
ToFacilityDepartmentID
ToFacilityID
ToFacilityLocationID
ToNodeDTS
ToScanAtFacilityID
ToScanID
ToSCaseID
ToSterilizerLoadID
ToUserName
ToWasherLoadID
ToWebUserID
That's an overwhelming number of IDs to possibly join on. I remember reading that the Postgres planner kind of gives up when you've got a dozen+ joins. The idea being that there are so many permutations to explore, that the planning time could quickly overwhelm the query time. If you boil it down, the "from" and "to" links are only ever going to have one key value across all of those fields. So, implemented as a polymorphic/promiscuous relations, something like this:
ID
InvID
OrgID
FacilityID
FromID
FromType
ToID
ToType
ToWebUserID
This table is going to be ginormous, so speed is/will be a consideration.
I encouraged the author not to use a polymorphic design, although the appeal is obvious. (I like Karwin's SQL Antipatterns book.) But now, confronted with nearly three dozen IDs, I'm a bit stumped.
Is there a common solution to this kind of problem? Namely, where you've got a central table like this with connections to a wide variety of possible tables? I don't have a Data Warehousing background, but this looks somewhat like that. (The author of this table has read Kimball's books, but not done any Data Warehouse implementations either.)
Important: We're using JOIN to do lookups on related values that might change, we're not using it to change the size of the result set. Just pretend it would always be LEFT JOIN.
With that in mind, what I've thought of is to skip joining on the From and To IDs, and instead use custom function calls to look up required values from the related tables. like (pseudo-code)
GetUserName(uuid) : citext
...and os on for other values of interest in this and other tables...
The function would return '' when the UUID is 0000etc.
I appreciate that this isn't the crispest question in the history of SO, and I what I'm hoping for pointers in a fruitful direction.
This smacks of “premature optimization” (which is a source of evil) based on something that you “remember reading”, so maybe some enlightenment about join optimization will help.
One rule of thumb that I follow in questions like this is to model things so that your queries become simple and natural. Experience shows that that often leads to good performance.
I assume that the table you show is the fact table of a star schema, and the foreign keys point to the many dimension tables, so that your query will look like
SELECT ...
FROM fact
JOIN dim1 ON fact.dim1_id = dim1.id
JOIN dim2 ON fact.dim3_id = dim2.id
JOIN dim3 ON fact.dim3_id = dim3.id
...
WHERE dim1.col1 = ...
AND dim2.col2 BETWEEN ... AND ...
AND dim3.col3 < ...
...
Now PostgreSQL will by default only consider all join permutations of the first eight tables (join_collapse_limit), and the rest of the tables are just joined in the order in which they appear in the query.
Moreover, if the number of tables reaches the threshold of 12 (geqo_threshold), the genetic query optimizer takes over, a component that simulates evolution by mutation and survival of the fittest with randomly chosen execution plans (really!) and consequently doesn't always come up with the same execution plan for the same query.
So my advice would be to write the queries in a way that the first seven dimension tables are the ones with the biggest chance of reducing the number of result rows most significantly (based on the WHERE conditions). You can also increase join_collapse_limit, because if your queries take a long time to run anyway, you can easily afford the planner to spend more time thinking about the best plan.
Then you'd set geqo = off to disable the genetic query optimizer.
If you design your queries according to these principles, you should be able to get good execution plans without messing up the data model.

MATLAB- Joining tables w/ overlapping data using key variable WHERE neither table contains all data points from the other one

I am working on combining 2 tables with different types of patient information using the PID (Patient Identity) feature present in both tables. Usually the function "join" (https://www.mathworks.com/help/matlab/ref/table.join.html) does the trick when one of the tables have information on all the patients from the other one. But in my case, both tables have certain values of PID (or information for new patients) that isn't present in the other one. How do I create a new table for using patient info from both tables that only contains info on the patients present in both tables?
I could probably write some long, clunky code to do this manually, but I was wondering if there's a function (or a few functions) that can do the task more efficiently. Thank you
The solution is to use either innerjoin or outerjoin.

Tableau Extract API with multiple tables in a database

I am currently experimenting with Tableau Extract API to generate some TDE from the tables I have in a PostgreSQL database. I was able to write a code to generate the TDE from single table, but I would like to do this for multiple joined tables. To be more specific, if I have two tables that are inner joined by some field, how would I generate the TDE for this?
I can see that if I am working with small number of tables, I could use a SQL query with JOIN clauses to create a one gigantic table, and generate the TDE from that table.
>> SELECT * FROM table_1 INNER JOIN table_2
INTO new_table_1
ON table_1.id_1 = table_2.id_2;
>> SELECT * FROM new_table_1 INNER JOIN TABLE_3
INTO new_table_2
ON new_table_1.id_1 = table_3.id_3
and then generate the TDE from new_table_2.
However, I have some tables that have over 40 different fields, so this could get messy.
Is this even a possibility with current version of the API?
You can read from as many tables or other sources as you want. Or use complex query with lots of joins, or create a view and read from that. Usually, creating a view is helpful when you have a complex query joining many tables.
The data extract API is totally agnostic about how or where you get the data to feed it -- the whole point is to allow you to grab data from unusual sources that don't have pre-built drivers for Tableau.
Since Tableau has a Postgres driver and can read from it directly, you don't need to write a program with the data extract API at all. You can define your extract with Tableau Desktop. If you need to schedule automated refreshes of the extract, you can use Tableau Server or its tabcmd command.
Many thanks for your replies. I am aware that I could use Tableau Desktop to define my extract. In fact, I have done this many times before. I am just trying to create the extracts using the API, because I need to create some calculated fields, which is near impossible to create using the Tableau Desktop.
At this point, I am hesitant to use JOINs in the SQL query because the resulting table would look too complicated to comprehend (some of these tables also have same field names).
When you say that I could read from multiple tables or sources, does that mean with the Tableau Extract API? At this point, I cannot find anywhere in this API that accommodates multiple sources. For example, I know that when I use multiple tables in the Tableau Desktop, there are icons on the left hand side that tells me that the extract is composed of multiple tables. This just doesn't seem to be happening with the API, which leaves me stranded. Anyways, thank you again for your replies.
Going back to the topic, this is something that I tried few days ago on my python code
try:
tdefile= tde.Extract("extract.tde")
except:
os.remove("extract.tde")
tdefile = tde.Extract("extract.tde")
tableDef = tde.TableDefinition()
# Read each column in table and set the column data types using tableDef.addColumn
# Some code goes here...
for eachTable in tableNames:
tableAdd = tdeFile.addTable(eachTable, tableDef)
# Use SQL query to retrieve bunch_of_rows from eachTable
for some_row in bunch_of_rows:
# Read each row in table, and set the values in each column position of each row
# Some code goes here...
tableAdd.insert(some_row)
some_row.close()
tdefile.close()
When I execute this code, I get the error that eachTable has to be called "Extract".
Of course, this code has its flaws, as there is no where in this code that tells how each table are being joined.
So I am little thrown off here, because it doesn't seem like I can use multiple tables unless I use JOINs to generate one table that contains everything.

Changed property value when selecting in EF4

I need to change the value of a property when I query the database using EF4. I have a company code that gets returned and I need to translate it to another company code, if needed. So, there is a stored procedure that is used to do this currently. Here's the old select statement.
SELECT companyName, TranslateCompanyCode(companyCode) as newCompanyCode FROM companyTable where companyCode = 'AA';
TranslateCompanyCode is the stored proc that does the translation. I'd like to do this in my new code when needed. I think I might need to use a Model-Defined Function. Anyone know how I can do this?
For your scenario, I would use a JOIN. Model-defined functions are cool when you need to perform a quick function on a value (particularly without an additional query). From a performance standpoint, a JOIN will be faster and more efficient than trying to put the sub-query in a model-defined function - particularly if you are selecting more than 1 row at a time.
However, if you do still want to use Model defined functions, then this example should point you in the right direction as to how to run a query within the function. This implementation will also be more complex than just using a join but is an alternative.

TSql Lookup function

I have a bunch of dimension tables that have unique ID and Name fields. I need a T-SQL function that returns an ID when passed a table name and a value for the name field.
I'm guessing the function would build a little query then execute it? Performance isn't an issue since this is a one time ETL thing.
Sounds like you want a scalar user defined function
There's a lot of other ways to do it though, and udf's certainly can have some perfomance issues.