T-SQL Indices when using OR statement - tsql

I've looked through several sources to get information on indices regarding AND statements, table joins, etc., but I've yet to find much useful information when there are OR statements present. That being said how would someone ideally handle creating indexes for a situation like the one here?
Updated SQL Statement to not use like.
SELECT *
FROM table_name
WHERE table_name.field1 = 'criteria'
and (table_name.field2 = 1 or
table_name.field3 = 0)
Obviously, I would want to create an index for field1. But whether I include fields 2 and 3 or handle this in another way, I'm not sure. If this were a simple three part AND statement I would probably use CREATE INDEX IND_index_name on table_name (field1, field2, field3) but I have reason to believe this logic doesn't work the same way for OR statements. Based on the statement given, I can assume that field1 always needs to be evaluated first, but then I have multiple ways I could handle this. The potential solutions I am evaluating are listed below, but I'm not sure if there may yet be another better solution. Any help would be greatly appreciated!
CREATE INDEX IND_index_name on table_name (field1) INCLUDING (field2, field3)
CREATE INDEX IND_index_name on table_name (field1, field2, field3)
CREATE INDEX IND_index_name1 on table_name (field1, field2)
CREATE INDEX IND_index_name2 on table_name (field1, field3)
CREATE INDEX IND_index_name1 on table_name (field1)
CREATE INDEX IND_index_name2 on table_name (field2)
CREATE INDEX IND_index_name3 on table_name (field3)
As additional info, I do not have access to SQL Server Management Studio tools because I am using DBeaver. For the sake of this example let's assume it takes almost a full workday to run without indexes. The answers submitted here will be used to tackle a much larger more complex query where data from table_name would used in several subsequent queries that run after the query shown above.

There is not one way to index OR
I would try 3 separate indexes

Related

Most efficient way to DECODE multiple columns -- DB2

I am fairly new to DB2 (and SQL in general) and I am having trouble finding an efficient method to DECODE columns
Currently, the database has a number of tables most of which have a significant number of their columns as numbers, these numbers correspond to a table with the real values. We are talking 9,500 different values (e.g '502=yes' or '1413= Graduate Student')
In any situation, I would just do WHERE clause and show where they are equal, but since there are 20-30 columns that need to be decoded per table, I can't really do this (that I know of).
Is there a way to effectively just display the corresponding value from the other table?
Example:
SELECT TEST_ID, DECODE(TEST_STATUS, 5111, 'Approved, 5112, 'In Progress') TEST_STATUS
FROM TEST_TABLE
The above works fine.......but I manually look up the numbers and review them to build the statements. As I mentioned, some tables have 20-30 columns that would need this AND some need DECODE statements that would be 12-15 conditions.
Is there anything that would allow me to do something simpler like:
SELECT TEST_ID, DECODE(TEST_STATUS = *TableWithCodeValues*) TEST_STATUS
FROM TEST_TABLE
EDIT: Also, to be more clear, I know I can do a ton of INNER JOINS, but I wasn't sure if there was a more efficient way than that.
From a logical point of view, I would consider splitting the lookup table into several domain/dimension tables. Not sure if that is possible to do for you, so I'll leave that part.
As mentioned in my comment I would stay away from using DECODE as described in your post. I would start by doing it as usual joins:
SELECT a.TEST_STATUS
, b.TEST_STATUS_DESCRIPTION
, a.ANOTHER_STATUS
, c.ANOTHER_STATUS_DESCRIPTION
, ...
FROM TEST_TABLE as a
JOIN TEST_STATUS_TABLE as b
ON a.TEST_STATUS = b.TEST_STATUS
JOIN ANOTHER_STATUS_TABLE as c
ON a.ANOTHER_STATUS = c.ANOTHER_STATUS
JOIN ...
If things are too slow there are a couple of things you can try:
Create a statistical view that can help determine cardinalities from the joins (may help the optimizer creating a better plan):
https://www.ibm.com/support/knowledgecenter/sl/SSEPGG_9.7.0/com.ibm.db2.luw.admin.perf.doc/doc/c0021713.html
If your license admits you can experiment with Materialized Query Tables (MQT). Note that there is a penalty for modifications of the base tables, so if you have more of a OLTP workload, this is probably not a good idea:
https://www.ibm.com/developerworks/data/library/techarticle/dm-0509melnyk/index.html
A third option if your lookup table is fairly static is to cache the lookup table in the application. Read the TEST_TABLE from the database, and lookup descriptions in the application. Further improvements may be to add triggers that invalidate the cache when lookup table is modified.
If you don't want to do all these joins you could create yourself an own LOOKUP function.
create or replace function lookup(IN_ID INTEGER)
returns varchar(32)
deterministic reads sql data
begin atomic
declare OUT_TEXT varchar(32);--
set OUT_TEXT=(select text from test.lookup where id=IN_ID);--
return OUT_TEXT;--
end;
With a table TEST.LOOKUP like
create table test.lookup(id integer, text varchar(32))
containing some id/text pairs this will return the text value corrseponding to an id .. if not found NULL.
With your mentioned 10k id/text pairs and an index on the ID field this shouldn't be a performance issue as such data amount should be easily be cached in the corresponding bufferpool.

Postgres "first" aggregation function

I am aggregation a table using file ID field. Each file has a name which matched exactly one (his) file id.
select file_key, min(fullfilepath)
from table
group by file_key
Because I know the structure of the table, I know that I need ANY fullfilepath. The min and the max are ok, but it requires a lot of time.
I came across this aggregation function which returns the first value. Unfortunately, this function takes a long time, because it scans the whole table. For example, this is very slow:
select first(file_id) from table;
What is the fastest way to do that? With or without aggregation function.
There is no way to make your first query with the GROUP BY clause faster, because it has to scan the whole table to find all groups.
Your second query can be made faster:
SELECT (
SELECT file_id FROM "table"
WHERE file_id IS NOT NULL
LIMIT 1
);
There is no way to optimize the query as you wrote it, because the aggregate function is a black box to PostgreSQL.
I doubt that this will help performance but it may be useful if anyone actually wants a first agregate.
-- coaslesce isn't a function so make an equivalent function.
create function coalesce_("anyelement","anyelement") returns "anyelement"
language sql as $$ select coalesce( $1,$2 ) $$;
create aggregate first("anyelement") (sfunc=coalesce_, stype="anyelement");
select
distinct on (file_key)
file_key, fullfilepath
from table
order by file_key
That will return one record for each file_key

ERROR: syntax error at or near "group"

Hello I'm writing an sql query But i am getting a syntax error on the line with the GROUP BY. What can possibly be the problem, help if you can please.
UPDATE intersection_points i
SET nbr_victimes = sum(tue+bl+bg)
FROM accident_ma a ,intersection_points i
WHERE (ST_DWithin(i.st_intersection,a.geom_acc, 10000) group by st_intersection)) ;
GROUP BY is its own clause, it's not part of a WHERE clause.
This is what you have:
WHERE (
ST_DWithin(i.st_intersection,a.geom_acc, 10000)
group by st_intersection
)
This is what you need:
WHERE ST_DWithin(i.st_intersection,a.geom_acc, 10000)
group by st_intersection
Edit: In response to comments, it sounds like your JOIN is a bit more complex than the UPDATE ... FROM syntax would need. Take a look at the "Notes" section on this page:
When a FROM clause is present, what essentially happens is that the target table is joined to the tables mentioned in the from_list, and each output row of the join represents an update operation for the target table. When using FROM you should ensure that the join produces at most one output row for each row to be modified. In other words, a target row shouldn't join to more than one row from the other table(s). If it does, then only one of the join rows will be used to update the target row, but which one will be used is not readily predictable.
Because of this indeterminacy, referencing other tables only within sub-selects is safer, though often harder to read and slower than using a join.
Normally this would involve changing the syntax to something like:
UDPATE SomeTable
SET SomeColumn = 'Some Value'
WHERE AnotherColumn =
(SELECT AnotherColumn
FROM AnotherTable
-- etc.)
However, the use of ST_DWithin() in this query may complicate that quite a bit. Without much deeper knowledge of the table structures, relationships, and overall intent of this update there probably isn't much more help I can give. Essentially you're going to need to clarify for the database exactly what records need to be updated and how to update them, which may involve changing your query to this latter sub-select syntax in some way.
I don' t understand your data structure. I create the following tables from your query. Please check table structure.
if table's structure is this
your query must be
UPDATE intersection_points SET nbr_victimes = (SELECT SUM(a.tue+a.bl+a.bg) FROM accident_ma a WHERE st_dwithin(st_intersection, a.geom_acc, 1000));

IN clause with large list in OpenJpa causing too complex statement

I have to create a named query where I need to group my results by some fields and also using an IN clause to limit my results.
The it looks something like this
SELECT new MyDTO(e.objID) FROM Entity e WHERE e.objId IN (:listOfIDs) GROUP BY e.attr1, e.attr2
I'm using OpenJPA and IBM DB2. In some cases my List of IDs can be very large (>80.000 IDs) and then the generated SQL statement becomes too complex for DB2, because the final generated statement prints out all IDs, like this:
SELECT new MyDTO(e.objID) FROM Entity e WHERE e.objId IN (1,2,3,4,5,6,7,...) GROUP BY e.attr1, e.attr2
Is there any good way to handle this kind of query? A possible Workaround would be to write the IDs in a temporary table and then using the IN clause on this table.
You should put all of the values in a table and rewrite the query as a join. This will not only solve your query problem, it should be more efficient as well.
declare global temporary table ids (
objId int
) with replace on commit preserve rows;
--If this statement is too long, use a couple of insert statements.
insert into session.ids values
(1,2,3,4,....);
select new mydto(e.objID)
from entity e
join session.ids i on
e.objId = i.objId
group by e.attr1, e.attr2;

Optimization of count query for PostgreSQL

I have a table in postgresql that contains an array which is updated constantly.
In my application i need to get the number of rows for which a specific parameter is not present in that array column. My query looks like this:
select count(id)
from table
where not (ARRAY['parameter value'] <# table.array_column)
But when increasing the amount of rows and the amount of executions of that query (several times per second, possibly hundreds or thousands) the performance decreses a lot, it seems to me that the counting in postgresql might have a linear order of execution (I’m not completely sure of this).
Basically my question is:
Is there an existing pattern I’m not aware of that applies to this situation? what would be the best approach for this?
Any suggestion you could give me would be really appreciated.
PostgreSQL actually supports GIN indexes on array columns. Unfortunately, it doesn't seem to be usable for NOT ARRAY[...] <# indexed_col, and GIN indexes are unsuitable for frequently-updated tables anyway.
Demo:
CREATE TABLE arrtable (id integer primary key, array_column integer[]);
INSERT INTO arrtable(1, ARRAY[1,2,3,4]);
CREATE INDEX arrtable_arraycolumn_gin_arr_idx
ON arrtable USING GIN(array_column);
-- Use the following *only* for testing whether Pg can use an index
-- Do not use it in production.
SET enable_seqscan = off;
explain (buffers, analyze) select count(id)
from arrtable
where not (ARRAY[1] <# arrtable.array_column);
Unfortunately, this shows that as written we can't use the index. If you don't negate the condition it can be used, so you can search for and count rows that do contain the search element (by removing NOT).
You could use the index to count entries that do contain the target value, then subtract that result from a count of all entries. Since counting all rows in a table is quite slow in PostgreSQL (9.1 and older) and requires a sequential scan this will actually be slower than your current query. It's possible that on 9.2 an index-only scan can be used to count the rows if you have a b-tree index on id, in which case this might actually be OK:
SELECT (
SELECT count(id) FROM arrtable
) - (
SELECT count(id) FROM arrtable
WHERE (ARRAY[1] <# arrtable.array_column)
);
It's guaranteed to perform worse than your original version for Pg 9.1 and below, because in addition to the seqscan your original requires it also needs an GIN index scan. I've now tested this on 9.2 and it does appear to use an index for the count, so it's worth exploring for 9.2. With some less trivial dummy data:
drop index arrtable_arraycolumn_gin_arr_idx ;
truncate table arrtable;
insert into arrtable (id, array_column)
select s, ARRAY[1,2,s,s*2,s*3,s/2,s/4] FROM generate_series(1,1000000) s;
CREATE INDEX arrtable_arraycolumn_gin_arr_idx
ON arrtable USING GIN(array_column);
Note that a GIN index like this will slow updates down a LOT, and is quite slow to create in the first place. It is not suitable for tables that get updated much at all - like your table.
Worse, the query using this index takes up to twice times as long as your original query and at best half as long on the same data set. It's worst for cases where the index is not very selective like ARRAY[1] - 4s vs 2s for the original query. Where the index is highly selective (ie: not many matches, like ARRAY[199]) it runs in about 1.2 seconds vs the original's 3s. This index simply isn't worth having for this query.
The lesson here? Sometimes, the right answer is just to do a sequential scan.
Since that won't do for your hit rates, either maintain a materialized view with a trigger as #debenhur suggests, or try to invert the array to be a list of parameters that the entry does not have so you can use a GiST index as #maniek suggests.
Is there an existing pattern I’m not aware of that applies to this
situation? what would be the best approach for this?
Your best bet in this situation might be to normalize your schema. Split the array out into a table. Add a b-tree index on the table of properties, or order the primary key so it's efficiently searchable by property_id.
CREATE TABLE demo( id integer primary key );
INSERT INTO demo (id) SELECT id FROM arrtable;
CREATE TABLE properties (
demo_id integer not null references demo(id),
property integer not null,
primary key (demo_id, property)
);
CREATE INDEX properties_property_idx ON properties(property);
You can then query the properties:
SELECT count(id)
FROM demo
WHERE NOT EXISTS (
SELECT 1 FROM properties WHERE demo.id = properties.demo_id AND property = 1
)
I expected this to be a lot faster than the original query, but it's actually much the same with the same sample data; it runs in the same 2s to 3s range as your original query. It's the same issue where searching for what is not there is much slower than searching for what is there; if we're looking for rows containing a property we can avoid the seqscan of demo and just scan properties for matching IDs directly.
Again, a seq scan on the array-containing table does the job just as well.
I think with Your current data model You are out of luck. Try to think of an algorithm that the database has to execute for Your query. There is no way it could work without sequential scanning of data.
Can You arrange the column so that it stores the inverse of data (so that the the query would be select count(id) from table where ARRAY[‘parameter value’] <# table.array_column) ? This query would use a gin/gist index.