distinct performance on large table - sql-server-2008-r2

Suppose I have a table with a column that has repeats, e.g.
Column1
---------
a
a
a
a
b
a
c
d
e
... so on
Maybe it has hundreds of thousands of rows. Then say, I need to pull the distinct values from this column. I can do so easily in a SELECT with DISTINCT, but I'm wondering on performance?
I could also give each item in Column1 an id, and then create a new table referenced by Column1 (to normalize this more appropriately). Though, this adds extra complexity to making an insert, and adds in joins for other possible queries.
Is there some way to index just the distinct values in a column, or is the normalization thing the only way to go?

Index on column1 will considerably speed up processing of distinct, but if you are willing to trade some space and some (short) time during insert/update/delete, you can resort to materialized view. This is indexed view you might consider as dynamic table produced and maintained following view definition.
create view view1
with schemabinding
as
select column1,
count_big(*) cnt
from theTable
group by column1
-- create unique clustered index ix_view1 on view1(column1)
(Do not forget to execute commented create index command. I usually do it this way so that view definition contains index definition, reminding me to apply it if I need to change the view.)
When you want to use it be sure to add noexpand hint to force use of materialized data (this part will remain mistery to me - something created as performance enhancement is not turned on by default, but rather activated on spot).
select *
from view1 (noexpand)

Related

how do I organize a table in postgres sql in ascending order?

I would like to organize my postgresql table in ascending order from the date it was created on.
So I tried:
SELECT *
FROM price
ORDER BY created_on;
And it did show me the database in that order, however it did not save it.
Is there a way I can make it so it gets saved?
Tables in a relational database represent unordered sets. There is no such thing as the "order of rows" in a table.
If you need a specific sort order, the only way is to use an order by in a select statement as you did.
If you don't want to type the order by each time, you can create a view that does that:
create view sorted_price
as
select *
from price
order by created_on;
But be warned: if you sort the rows from the view in a different way, e.g. select * from sorted_price order by created_on desc Postgres will actually apply two sorts. The query optimizer is unfortunately not smart enough to remove the one store in the view's definition.

Most efficient way to DECODE multiple columns -- DB2

I am fairly new to DB2 (and SQL in general) and I am having trouble finding an efficient method to DECODE columns
Currently, the database has a number of tables most of which have a significant number of their columns as numbers, these numbers correspond to a table with the real values. We are talking 9,500 different values (e.g '502=yes' or '1413= Graduate Student')
In any situation, I would just do WHERE clause and show where they are equal, but since there are 20-30 columns that need to be decoded per table, I can't really do this (that I know of).
Is there a way to effectively just display the corresponding value from the other table?
Example:
SELECT TEST_ID, DECODE(TEST_STATUS, 5111, 'Approved, 5112, 'In Progress') TEST_STATUS
FROM TEST_TABLE
The above works fine.......but I manually look up the numbers and review them to build the statements. As I mentioned, some tables have 20-30 columns that would need this AND some need DECODE statements that would be 12-15 conditions.
Is there anything that would allow me to do something simpler like:
SELECT TEST_ID, DECODE(TEST_STATUS = *TableWithCodeValues*) TEST_STATUS
FROM TEST_TABLE
EDIT: Also, to be more clear, I know I can do a ton of INNER JOINS, but I wasn't sure if there was a more efficient way than that.
From a logical point of view, I would consider splitting the lookup table into several domain/dimension tables. Not sure if that is possible to do for you, so I'll leave that part.
As mentioned in my comment I would stay away from using DECODE as described in your post. I would start by doing it as usual joins:
SELECT a.TEST_STATUS
, b.TEST_STATUS_DESCRIPTION
, a.ANOTHER_STATUS
, c.ANOTHER_STATUS_DESCRIPTION
, ...
FROM TEST_TABLE as a
JOIN TEST_STATUS_TABLE as b
ON a.TEST_STATUS = b.TEST_STATUS
JOIN ANOTHER_STATUS_TABLE as c
ON a.ANOTHER_STATUS = c.ANOTHER_STATUS
JOIN ...
If things are too slow there are a couple of things you can try:
Create a statistical view that can help determine cardinalities from the joins (may help the optimizer creating a better plan):
https://www.ibm.com/support/knowledgecenter/sl/SSEPGG_9.7.0/com.ibm.db2.luw.admin.perf.doc/doc/c0021713.html
If your license admits you can experiment with Materialized Query Tables (MQT). Note that there is a penalty for modifications of the base tables, so if you have more of a OLTP workload, this is probably not a good idea:
https://www.ibm.com/developerworks/data/library/techarticle/dm-0509melnyk/index.html
A third option if your lookup table is fairly static is to cache the lookup table in the application. Read the TEST_TABLE from the database, and lookup descriptions in the application. Further improvements may be to add triggers that invalidate the cache when lookup table is modified.
If you don't want to do all these joins you could create yourself an own LOOKUP function.
create or replace function lookup(IN_ID INTEGER)
returns varchar(32)
deterministic reads sql data
begin atomic
declare OUT_TEXT varchar(32);--
set OUT_TEXT=(select text from test.lookup where id=IN_ID);--
return OUT_TEXT;--
end;
With a table TEST.LOOKUP like
create table test.lookup(id integer, text varchar(32))
containing some id/text pairs this will return the text value corrseponding to an id .. if not found NULL.
With your mentioned 10k id/text pairs and an index on the ID field this shouldn't be a performance issue as such data amount should be easily be cached in the corresponding bufferpool.

TSQL: left join on view very slow

in some complex stored procedure we use view and left join on this view. It takes 40 sec to execute.
Now, if we create TABLE variable and store in it result of view, then do the left join on this variable and not on the view, it takes 3 sec...
What can explain this behavior?
The view expands into the main query. S
So if you have 5 tables in the view, these expand with the extra table into one big query plan with 6 tables. The performance difference will most likely be caused by added complexity and permutations of the extra table you left join with.
Another potential issue: Do you then left join on a column that has some processing on it? This will further kill performance.
Courtesy of Alden W. This response for one of my questions like yours by Alden W.
You may be able to get better performance by specifying the VIEW ALGORITHM as MERGE. With MERGE MySQL will combine the view with your outside SELECT's WHERE statement, and then come up with an optimized execution plan.
To do this however you would have to remove the GROUP BY statement from your VIEW. As it is, if a GROUP BY statement is included in your view, MySQL will choose the TEMPLATE algorithm. A temporary table is being created of the entire view first, before being filtered by your WHERE statement.
If the MERGE algorithm cannot be used, a temporary table must be used instead. MERGE cannot be used if the view contains any of the following constructs:
Aggregate functions (SUM(), MIN(), MAX(), COUNT(), and so forth)
DISTINCT
GROUP BY
HAVING
LIMIT
UNION or UNION ALL
Subquery in the select list
Refers only to literal values (in this case, there is no underlying table)
Here is the link with more info. http://dev.mysql.com/doc/refman/5.0/en/view-algorithms.html
If you can change your view to not include the GROUP BY statement, to specify the view's algorithm the syntax is:
CREATE ALGORITHM = MERGE VIEW...

select * or individual columns

Is there any difference in performance between
select *
from table name
and
select [col1]
,[col2]
......
,[coln]
from table name
It is a SQL antipattern to use SELECT *. It is faster for the database when you specify columns (it doesn't have to look them up) and more importantly, you should not ever specify more columns than you actually need. If you have a join in your query you have at least one column you don't need (the join column) and thus SELECT * is always slower to return records in a query with a join since it is returning more infomation than necessary.
Now all this sounds like a small improvement and in a small system it might be, but as the database grows and gets busy, the performance implications become bigger. There is no excuse for using SELECT *.
SELECT * is also bad for maintenance especially if you use it to insert records - always specify both the columns you are inserting to and the fields from the select in an INSERT statement that uses a SELECT. It will break if you change the table structure. You may also end up showing the user columns you don't want them to see such as the GUID added for replication.
If you use SELECT * in a view (at least in SQL Server) the view will still not automatically update in the case of a change to the underlying tables. If someone gets the silly idea to re-arrange the column order in the table (yeah I know you shouldn't but people do this on occasion) using SELECT * might mean that data will show up in the wrong columns in reports or inserts which can cause problems where things might be misinterpreted. I can think of one case where two columns were swapped in a staging table and the social security number became the amount we intended to pay the person giving a speech. You can see how that might really muck up the accounting except we didn't use SELECT * so we were safe because the columns retained the same name.
I want to note that you don't even save much development time by using SELECT *. It takes me a max of about 15 seconds more to use the table names even in a large table as I drag them over from the object browser in SQL Server (you can get all the columns in one step).
I suppose select * takes a few cpu cycles less since sql server parses your statements, but i don't think it will make a noticable difference

SQLite - a smart way to remove and add new objects

I have a table in my database and I want for each row in my table to have an unique id and to have the rows named sequently.
For example: I have 10 rows, each has an id - starting from 0, ending at 9. When I remove a row from a table, lets say - row number 5, there occurs a "hole". And afterwards I add more data, but the "hole" is still there.
It is important for me to know exact number of rows and to have at every row data in order to access my table arbitrarily.
There is a way in sqlite to do it? Or do I have to manually manage removing and adding of data?
Thank you in advance,
Ilya.
It may be worth considering whether you really want to do this. Primary keys usually should not change through the lifetime of the row, and you can always find the total number of rows by running:
SELECT COUNT(*) FROM table_name;
That said, the following trigger should "roll down" every ID number whenever a delete creates a hole:
CREATE TRIGGER sequentialize_ids AFTER DELETE ON table_name FOR EACH ROW
BEGIN
UPDATE table_name SET id=id-1 WHERE id > OLD.id;
END;
I tested this on a sample database and it appears to work as advertised. If you have the following table:
id name
1 First
2 Second
3 Third
4 Fourth
And delete where id=2, afterwards the table will be:
id name
1 First
2 Third
3 Fourth
This trigger can take a long time and has very poor scaling properties (it takes longer for each row you delete and each remaining row in the table). On my computer, deleting 15 rows at the beginning of a 1000 row table took 0.26 seconds, but this will certainly be longer on an iPhone.
I strongly suggest that you re-think your design. In my opinion your asking yourself for troubles in the future (e.g. if you create another table and want to have some relations between the tables).
If you want to know the number of rows just use:
SELECT count(*) FROM table_name;
If you want to access rows in the order of id, just define this field using PRIMARY KEY constraint:
CREATE TABLE test (
id INTEGER PRIMARY KEY,
...
);
and get rows using ORDER BY clause with ASC or DESC:
SELECT * FROM table_name ORDER BY id ASC;
Sqlite creates an index for the primary key field, so this query is fast.
I think that you would be interested in reading about LIMIT and OFFSET clauses.
The best source of information is the SQLite documentation.
If you don't want to take Stephen Jennings's very clever but performance-killing approach, just query a little differently. Instead of:
SELECT * FROM mytable WHERE id = ?
Do:
SELECT * FROM mytable ORDER BY id LIMIT 1 OFFSET ?
Note that OFFSET is zero-based, so you may need to subtract 1 from the variable you're indexing in with.
If you want to reclaim deleted row ids the VACUUM command or pragma may be what you seek,
http://www.sqlite.org/faq.html#q12
http://www.sqlite.org/lang_vacuum.html
http://www.sqlite.org/pragma.html#pragma_auto_vacuum