sybase - fails to use index unless string is hard-coded - tsql

I'm using Sybase 12.5.3 (ASE); I'm new to Sybase though I've worked with MSSQL pretty extensively. I'm running into a scenario where a stored procedure is really very slow. I've traced the issue to a single SELECT stmt for a relatively large table. Modifying that statement dramatically improves the performance of the procedure (and reverting it drastically slows it down; i.e., the SELECT stmt is definitely the culprit).
-- Sybase optimizes and uses multi-column index... fast!<br>
SELECT ID,status,dateTime
FROM myTable
WHERE status in ('NEW','SENT')
ORDER BY ID
-- Sybase does not use index and does very slow table scan<br>
SELECT ID,status,dateTime
FROM myTable
WHERE status in (select status from allowableStatusValues)
ORDER BY ID
The code above is an adapted/simplified version of the actual code. Note that I've already tried recompiling the procedure, updating statistics, etc.
I have no idea why Sybase ASE would choose an index only when strings are hard-coded and choose a table scan when choosing from another table. Someone please give me a clue, and thank you in advance.

1.The issue here is poor coding. In your release, poor code and poor table design are the main reasons (98%) the optimiser makes incorrect decisions (the two go hand-in-hand, I have not figured out the proportion of each). Both:
WHERE status IN ('NEW','SENT')
and
WHERE status IN (SELECT status FROM allowableStatusValues)
are substandard, because in both cases they cause ASE to create a worktable for the contents between the brackets, which can easily be avoided (and all consequential issues avoided with it). There is no possibility of statistics on a worktable, since the statistics on either t.status or s.status is missing (AdamH is correct re that point), it correctly chooses a table scan.
Subqueries have their place, but never as a substitute for a pure (the tables are related) join. The corrections are:
WHERE status = "NEW" OR status = "SENT"
and
FROM myTable t,
allowableStatusValues s
WHERE t.status = s.status
2.The statement
|Now you don't have to add an index to get statistics on a column, but it's probably the best way.
is incorrect. Never create Indices that you will not use. If you want statistics updated on a column, simply
UPDATE STATISTICS myTable (status)
3.It is important to ensure that you have current statistics on (a) all indexed columns and (b) all join columns.
4.Yes, there is no substitute for SHOWPLAN on every code segment that is intended for release, doubly so for any code with questionable performance. You can also SET NOEXEC ON, to avoid execution, eg. for large result sets.

An index hint will work around it, but is probably not the solution.
Firstly I'd like to know if there is an index on allowableStatusValues.status, if there is then sybase will have stats on it and will have a good idea on the number of values in there.
If not then the optimiser probably won't have a good idea how many different values Status may take. It's then having to make the assumption that you're going to be extracting almost all of the rows from myTable, and the best way of doing this is a table scan (if no covering index).
Now you don't have to add an index to get statistics on a column, but it's probably the best way.
If you do have an index on allowableStatusValues.status, then i'd wonder how good your stats are. Get yourself a copy of sp__optdiag. You probably also need to tune the values of "histogram tuning factor" and "number of histogram steps", increasing these slightly from the defaults will give you more detailed statistics which always helps the optimiser.

Does it still do a table scan if you replace the subquery with a join:
SELECT m.ID,m.status,m.dateTime
FROM myTable m
JOIN allowableStatusValues a on m.status = a.status
ORDER BY ID

Rather than relying on experimental observations of how long a query takes to run, I would highly recommend getting Sybase to show you the execution plans for each query, for example:
SET showplan ON
GO
-- query/procedure call goes here
SELECT id, status, datetime
FROM myTable
WHERE status IN('NEW','SENT')
ORDER BY id
GO
SET showplan OFF
GO
With SET showplan ON, Sybase generates execution plans for every statement it executes. These can be invaluable in helping to identify where queries are not making use of appropriate indexes. For stored procedures in Sybase, the execution plan for the entire procedure is generated when the stored procedure is first executed after being compiled.
If you post the plans for each of your queries we might be able to shed more light on the problem.

Amazingly, using an index hint resolves the issue (see the (index myIndexName) line below - re-written/simplififed code below:
-- using INDEX HINT
SELECT ID,status,dateTime
FROM myTable (index myIndexName)
WHERE status in (select status from allowableStatusValues)
ORDER BY ID
Weird that I have to use this technique to avoid a table scan, but there ya go.

Garrett, by showing only the simplified code, you have likely stripped out exactly the information that would illuminate the source of the problem.
My first guess would be a type mismatch between allowableStatusValues.status and myTable.status. However, that is not the only possibility. As ninesided stated, the complete query plans (using showplan and fmtonly flags), as well as the actual table definitions and stored procedure source, is much more likely to produce a useful answer.

Related

PostgreSQL query - show which rows are locked

I would like to query data from a table, and if a row is locked, show it as a different color. Is this possible using postgresql's locking for update?
e.g.
select
*,
(select from pg_x -- link row somehow )
from table
thank you
There is no good way to do that. The row locks are stored in the row (system column xmax), but this attribute serves other purposes too, and the flags that determine if it is indeed a row lock or perhaps a rolled back update are not exposed via SQL.
There are only unpleasant alternatives:
Use the pageinspect contrib module to examine those flags. That would be a second scan of the table, and such a query doesn't respect MVCC visibility.
Run a second query:
SELECT * FROM atable
FOR UPDATE SKIP LOCKED;
That would lock all rows in the table and be very bad for concurrency.
Besides, that information would be pretty useless for the user. In a well-written application, row locks are only held for split seconds, so the information would be outdated by the time it reaches the user.

Execution Plan on a View looking at Partitioned Tables

I currently have tables that are partitioned out by year & month for our sales transactions. For example, we have sales tables that would look something like this:
factdailysales_201501
factdailysales_201502
factdailysales_201503 etc ...
Generally, I've always performed dynamic SQL to capture a Start Date, End Date, find out what partitions those are, and then loop through each of those partitions ... but its starting to become such a hassle and I've learned that this is probably not the best way to do it in terms of just maintenance, trouble shooting, and performance.
I decided to build a view that would UNION ALL of my sales partitions together. However, I don't want selecting from the view to have to scan all of the partitions on execution, it would take away the whole purpose of partitioning tables out. Because of this, I added check constraints on date to each of my sales tables. This way when I selected from the view, it would know which tables to access from instead of scanning every table.
Here are the following examples below:
SELECT SUM([retail])
FROM Sales_Orig
WHERE [Date] >= '2015-03-01'
This query has the execution plan of only pulling from the partitions that I need.
My problem that i'm facing right now is that most of the time when my team will be writing stored procedures, they would more than likely write their queries where a date variable is passed into the where statement.
DECLARE #SD DATE = '2015-03-01'
SELECT SUM([retail])
FROM Sales_Orig
WHERE [Date] >= #SD
However, when a variable is being passed in, the execution plan now scans ALL of the partitions in the view, causing the performance to take wayyy longer than when I hard coded in the date
I suppose I could do dynamic SQL again and insert the date string into the SELECT statement, but it would bring me back to the beginning of trying to get rid of dynamic SQL in the first place for this simple sales query.
So my question is, am I setting this up wrong? Am I on the right track? It seems that the view can't take in a variable for the check constraint and ends up scanning every table. Is there another approach anyone would recommend? Maybe my original solution of just looping through partitions via dynamic SQL is the best way to do it?
** EDIT **
http://sqlsunday.com/2014/08/31/partitioned-views/
This article is actually where I initially saw the idea! It seems when using that exact same solution, I'm still experiencing the same struggle!
Thanks!!
Okay this might work. It's a table-valued function that only access tables according to your #start and #end parameters so only accessing your "partitions" that it needs. I figured you could take this concept and write some dynamic SQL to create all the if statements.
Now of course new tables are added every day so how does that tie in. Well I think the best way would be is that every day you alter the function adding the next sales table. That way querying it is simple. And you could use the same dynamic sql you used to create the function to alter it which should be relatively simple.
Note: I added default values that are the min and max of the data type DATE. That way you could query something like everything from 20140101 and onward or vice versa.
Your tables
SELECT CAST('20150101' AS DATE) datesVal INTO factDailySales_20150101;
SELECT CAST('20150102' AS DATE) datesVal INTO factDailySales_20150102;
SELECT CAST('20150103' AS DATE) datesVal INTO factDailySales_20150103;
The Function
CREATE FUNCTION ufn_factTotalSales (#Start DATE = '17530101', #End DATE = '99991231')
RETURNS #factTotalSales TABLE
(
datesVal DATE
)
AS
BEGIN
IF(CAST('20150101' AS DATE) BETWEEN #Start AND #End)
BEGIN
INSERT INTO #factTotalSales
SELECT datesVal
FROM factDailySales_20150101
END
IF(CAST('20150102' AS DATE) BETWEEN #Start AND #End)
BEGIN
INSERT INTO #factTotalSales
SELECT datesVal
FROM factDailySales_20150102
END
IF(CAST('20150103' AS DATE) BETWEEN #Start AND #End)
BEGIN
INSERT INTO #factTotalSales
SELECT datesVal
FROM factDailySales_20150103
END
RETURN;
END
GO
All tables
SELECT *
FROM ufn_factTotalSales(default,default)
All tables greater than or equal to 20150102
SELECT *
FROM ufn_factTotalSales('20150102',default)
**All tables less than or equal to 20150102
SELECT *
FROM ufn_factTotalSales(default,'20150102')
All tables between specific range
SELECT *
FROM ufn_factTotalSales('20150101','20150102')
Is this the ideal solution? No. The ideal would be to combine all tables into one and having good indexes. I know you said that wouldn't work because of the way other code has been written. Hear me out. Now perhaps this is off the wall, lets say you do combine the tables but obviously there are old scripts looking for specific daily sales tables. Maybe you could create views with the dailySales names that access the factTotalSales. OR You could create synonyms for the factTotalSales that would correspond to each factDailySales.
Maybe you could look into that. It wouldn't be easy, but I think letting SQL Server optimize your queries the way it was designed is a better way of doing it instead of forcing it with dynamic SQL.
Just my two cents. Hope this helps. At the very least, I hope it gave you some ideas.
5 years later: option(recompile).
The planner needs to have access to the constants to eliminate the table entirely from the query plan. With a variable, without a forced recompile, a generic plan is used. (Related: parameter sniffing.)
While this means the query plan is larger as it has to include all tables, it does not mean that all tables are actually scanned: look at the IO stats, as table scan elimination occurs even if such shows in the query plan.
The 'Number Of Executions' in the query plan will be 0 when the tables are not scanned: unfortunately, these branches are still reported as a non-zero percentage cost "Table Scan" node in the query plan & UI, which will appear high proportionally if the query is trivially fast. The displayed percentage cost of these extra "Table Scan" nodes approaches zero as the amount of data returned from the actually used base tables increases.
This same optimization/elimination occurs when the view is not a Partitioned View (eg. base tables are missing partition column in PK), yet the underlying tables have a suitable Check Constraint on the filtered column. It also occurs when the view selects a constant value to establish the partition that is not otherwise stored in the table. With a constant in the query or recompiled plan the tables will be eliminated entirely. With a variable the tables will still not actually be scanned and thus eliminated logically during query execution.
The use of a proper Partitioned View is only really beneficial to allow a direct Insert & Update, with the major caveat that it requires the partition column to be in each table's PK and disallows the use of an identity column (making a Partitioned View largely useless IMOHO). SQL Server handles the optimizations very similarly for other quasi-Partitioned View cases.
(This is on SQL Server 2014; earlier versions might not have optimized the different patterns as efficiently.)

Is there a logically equivalent and efficient version of this query without using a CTE?

I have a query on a postgresql 9.2 system that takes about 20s in it's normal form but only takes ~120ms when using a CTE.
I simplified both queries for brevity.
Here is the normal form (takes about 20s):
SELECT *
FROM tableA
WHERE (columna = 1 OR columnb = 2) AND
atype = 35 AND
aid IN (1, 2, 3)
ORDER BY modified_at DESC
LIMIT 25;
Here is the explain for this query: http://explain.depesz.com/s/2v8
The CTE form (about 120ms):
WITH raw AS (
SELECT *
FROM tableA
WHERE (columna = 1 OR columnb = 2) AND
atype = 35 AND
aid IN (1, 2, 3)
)
SELECT *
FROM raw
ORDER BY modified_at DESC
LIMIT 25;
Here is the explain for the CTE: http://explain.depesz.com/s/uxy
Simply by moving the ORDER BY to the outer part of the query reduces the cost by 99%.
I have two questions: 1) is there a way to construct the first query without using a CTE in such a way that it is logically equivalent more performant and 2) what does this difference in performance say about how the planner is determining how to fetch the data?
Regarding the questions above, are there additional statistics or other planner hints that would help improve the performance of the first query?
Edit: Taking away the limit also causes the query to use a heap scan as opposed to an index scan backwards. Without the LIMIT the query completes in 40ms.
After seeing the effect of the LIMIT I tried with LIMIT 1, LIMIT 2, etc. The query performs in under 100ms when using LIMIT 1 and 10s+ with LIMIT > 1.
After thinking about this some more, question 2 boils down to why does the planner use an index scan backwards in one case and a bitmap heap scan + sort in another logically equivalent case? And how can I "help" the planner use the efficient plan in both cases?
Update:
I accepted Craig's answer because it was the most comprehensive and helpful. The way I ended up solving the problem was by using a query that was practically equivalent though not logically equivalent. At the root of the issue was an index scan backwards of the index on modified_at. In order to inform the planner that this was not a good idea I add a predicate of the form WHERE modified_at >= NOW() - INTERVAL '1 year'. This included enough data for the application but prevented the planner from going down the backwards index scan path.
This was a much lower impact solution that prevented the need to rewrite the queries using either a sub query or a CTE. YMMV.
Here's why this is happening, with the following explanation current until at least 9.3 (if you're reading this and on a newer version, check to make sure it hasn't changed):
PostgreSQL doesn't optimize across CTE boundaries. Each CTE clause is run in isolation and its results are consumed by other parts of the query. So a query like:
WITH blah AS (
SELECT * FROM some_table
)
SELECT *
FROM blah
WHERE id = 4;
will cause the full inner query to get executed. PostgreSQL won't "push down" the id = 4 qualification into the inner query. CTEs are "optimization fences" in that regard, which can be both good or bad; it lets you override the planner when you want to, but prevents you from using CTEs as simple syntactic cleanup for a deeply nested FROM subquery chain if you do need push-down.
If you rephrase the above as:
SELECT *
FROM (SELECT * FROM some_table) AS blah
WHERE id = 4;
using a sub-query in FROM instead of a CTE, Pg will push the qual down into the subquery and it'll all run nice and quickly.
As you have discovered, this can also work to your benefit when the query planner makes a poor decision. It appears that in your case a backward index scan of the table is immensely more expensive a bitmap or index scan of two smaller indexes followed by a filter and sort, but the planner doesn't think it will be so it plans the query to scan the index.
When you use the CTE, it can't push the ORDER BY into the inner query, so you're overriding its plan and forcing it to use what it thinks is an inferior execution plan - but one that turns out to be much better.
There's a nasty workaround that can be used for these situations called the OFFSET 0 hack, but you should only use it if you can't figure out a way to make the planner do the right thing - and if you have to use it, please boil this down to a self-contained test case and report it to the PostgreSQL mailing list as a possible query planner bug.
Instead, I recommend first looking at why the planner is making the wrong decision.
The first candidate is stats / estimates problems, and sure enough when we look at your problematic query plan there's a factor of 3500 mis-estimation of the expected result rows. That's big, but not impossibly big, though it's more interesting that you actually only get one row where the planner is expecting a non-trivial row set. That doesn't help us much, though; if the row count is lower than expected that means that choosing to use the index was a better choice than expected.
The main issue looks like it's not using the smaller, more selective indexes sierra_kilo and papa_lima because it sees the ORDER BY and thinks that it'll save more time doing a backward index scan and avoiding the sort than it really does. That makes sense given that there's only one matching row to sort! If it got the expected 3500 rows then it might've made more sense to avoid the sort, though that's still a fairly small rowset to just sort in memory.
Do you set any parameters like enable_seqscan, etc? If you do, unset them; they're for testing only and totally inappropriate for production use. If you aren't using the enable_ params I think it's worth raising this on the PostgreSQL mailing list pgsql-perform. The anonymized plans make this a bit difficult, though, especially since there's no gurantee that identifiers from one plan refer to the same objects in the other plan, and they don't match what you wrote in the query on the question. You'll want to produce a properly hand-done version where everything matches up before asking on the mailing list.
There's a fairly good chance that you'll need to provide the real values for anyone to help. If you don't want to do that on a public mailing list, there's another option available. (I should note that I work for one of them, per my profile).
Just a shot in the dark, but what happens if you run this
SELECT *
FROM (
SELECT *
FROM tableA
WHERE (columna = 1 OR columnb = 2) AND
atype = 35 AND
aid IN (1, 2, 3)
) AS x
ORDER BY modified_at DESC
LIMIT 25;

Is using Table variables faster than temp tables

Am I safe to assume that where I have stored procedures using the tempdb to write a temporary table, I'd be better off switching these to table variables to get better performance?
Temp tables are better in performance. If you use a Table Variable and the Data in the Variable gets too big, the SQL Server converts the Variable automatically into a temp table.
It depends, like almost every Database related question, on what you try to do. So it is hard to answer without more information.
So my answer is, try it and have a look at the execution plan. Use the fastest way with the lowest costs.
MSDN - Displaying Graphical Execution Plans (SQL Server Management Studio)
#Table can be faster as there is less "setup time" since the object is in memory only.
#Tables have a lot of catches though.
You can have a primary key on a #Table but thats about it. Other indexes Clustered NonClustered for combinations of columns are not possible.
Also if your table is going to contain any real data volumes (more then about 200 maybe 1000 rows) then accessing the table will be slower. Especially when you will probably not have a useful index on it.
#Tables are a pain in procs as they need to be dropped when debugging, They take longer to create. and they take longer to setup as you need to add indexs as a second step. But if you have lots of data then its #tables every time.
Even in cases where you have less then 100 rows of data in a table you may still want to use #Tables as you can create a usefull index on the table.
In summary i use #Tables most of the time for the ease when doing simple proc etc. But anything that need to perform should be a #Table.
#Tables have no statistics so the execution plan entails more guesswork. Hence the recommended upper limit of 1000-ish rows. #Tables have statistics but these can be cached between invocations. If your cardinalities differ significantly each time the SP runs you'd want to REBUILD and RECOMPILE each time. This is an overhead, of course, but one which must be balanced against the cost of a rubbish plan.
Both types will do IO to TempDB.
So no, #Tables are not a panacea.
Table variables can perform very poorly as the number of rows in them increases.
Why is this?
Table variables don’t have distribution statistics and don’t trigger recompiles. Because of this, SQL Server is not able to estimate the number of rows in a table variable like it does for normal tables. When the optimiser compiles code that contains a table variable, it assumes a table is empty and uses an expected row count of 1 for the cardinality estimate. Because the optimiser only thinks a table variable contains a single row, it picks operators for the execution plan that work well with a small set of records, like the NESTED LOOPS operator for a JOIN operation.
As an example, I have just fixed a stored procedure which was performing poorly. The code was populating a table variable and using it in a join to filter the number of rows to accounts which were relevant:
FROM dbo.DimInvestorAccount
INNER JOIN #accounts acclist
ON acclist.AccountNumber = DimInvestorAccount.investorAccountNumber
+ 9 additional tables joined...
When run for list of 1700 accounts, the query was taking 1m17s. Just changing the filter table definition from:
DECLARE #accounts TABLE (AccountNumber VARCHAR(20) COLLATE Latin1_General_BIN INDEX idx NONCLUSTERED)
to
CREATE TABLE #accounts (AccountNumber VARCHAR(20) COLLATE Latin1_General_BIN INDEX idx NONCLUSTERED)
brought the query time down to 800ms. Note that with 5 rows in the table, there was no significant difference - both temp table and table variable run in +/-400ms.
Microsoft's recommendation is to use Table Variables if the number of rows is <100.
Note that Microsoft have made changes in SQL Server 2019 to improve this (v15.x/Compatibility level 150)

select * or individual columns

Is there any difference in performance between
select *
from table name
and
select [col1]
,[col2]
......
,[coln]
from table name
It is a SQL antipattern to use SELECT *. It is faster for the database when you specify columns (it doesn't have to look them up) and more importantly, you should not ever specify more columns than you actually need. If you have a join in your query you have at least one column you don't need (the join column) and thus SELECT * is always slower to return records in a query with a join since it is returning more infomation than necessary.
Now all this sounds like a small improvement and in a small system it might be, but as the database grows and gets busy, the performance implications become bigger. There is no excuse for using SELECT *.
SELECT * is also bad for maintenance especially if you use it to insert records - always specify both the columns you are inserting to and the fields from the select in an INSERT statement that uses a SELECT. It will break if you change the table structure. You may also end up showing the user columns you don't want them to see such as the GUID added for replication.
If you use SELECT * in a view (at least in SQL Server) the view will still not automatically update in the case of a change to the underlying tables. If someone gets the silly idea to re-arrange the column order in the table (yeah I know you shouldn't but people do this on occasion) using SELECT * might mean that data will show up in the wrong columns in reports or inserts which can cause problems where things might be misinterpreted. I can think of one case where two columns were swapped in a staging table and the social security number became the amount we intended to pay the person giving a speech. You can see how that might really muck up the accounting except we didn't use SELECT * so we were safe because the columns retained the same name.
I want to note that you don't even save much development time by using SELECT *. It takes me a max of about 15 seconds more to use the table names even in a large table as I drag them over from the object browser in SQL Server (you can get all the columns in one step).
I suppose select * takes a few cpu cycles less since sql server parses your statements, but i don't think it will make a noticable difference