I'm joining table A against tables B and C. All three tables have similar columns and indexes, and B and C have about the same number of rows. But A and B have indexes on nvarchar columns while C has indexes on varchar columns.
Tested separately, joining on B is 30-60 times faster than joining on C. (4 seconds vs. 2-4 minutes.) Looking at the execution plan, B uses an index seek while C uses an index scan. The details for the join on C mention implicit conversion of the varchar columns, while the join on B mentions no such conversion. Is this why it's using a scan instead of a seek, and is this probably why it's so slow? (Another potential issue: the index scan on C has an estimated number of executions of 1, but the actual number of executions is around 8500.)
C is static historical data, so I could alter the columns and rebuild the indexes if it would help.
Yes implicit conversions may result in index scans instead of seeks. Data is converted to the data type with the higher data type precedence. As you've seen in this case the VARCHAR column of table c is converted to an NVARCHAR value. The implicit conversion protects against losing data during conversion, i.e. NVARCHAR columns can hold significantly more distinct characters than VARCHAR so the implicit conversion of VARCHAR from table C ensures that all the values from table C will be preserved.
Details on specific implicit conversion scenarios are further outlined here. If you have the option and this won't have any negative implications elsewhere, I'd suggest making this column in table C an NVARCHAR data type.
Related
I want to get max length (in bytes) of a variable-length column. One of my columns has the following definition:
shortname character varying(35) NOT NULL
userid integer NOT NULL
columncount smallint NOT NULL
I tried to retrieve some info from the pg_attribute table, but the attlen column has -1 value for all variable-length columns. I also tried to use pg_column_size function, but it doesn't accept the name of the column as an input parameter.
It can be easily done in SQL Server.
Are there any other ways to get the value I'm looking for?
You will need to use a CASE expression checks pg_attribute.attlen and then calculate the maximum size in bytes depending on that. To get the max size for a varchar column you can "steal" the expression used in information_schema.columns.character_octet_length for varchar or char columns
Something along the lines:
select a.attname,
t.typname,
case
when a.attlen <> -1 then attlen
when t.typname in ('bytea', 'text') then pg_size_bytes('1GB')
when t.typname in ('varchar', 'char') then information_schema._pg_char_octet_length(information_schema._pg_truetypid(a.*, t.*), information_schema._pg_truetypmod(a.*, t.*))
end as max_bytes
from pg_attribute a
join pg_type t on a.atttypid = t.oid
where a.attrelid = 'stuff.test'::regclass
and a.attnum > 0
and not a.attisdropped;
Note that this won't return a proper size for numeric as that is also a variable length type. The documentation says "he actual storage requirement is two bytes for each group of four decimal digits, plus three to eight bytes overhead".
As a side note: this seems an extremely strange thing to do. Especially with your mentioning of temp tables in stored procedures. More often than not, the use of temp tables is not needed in Postgres. Instead of blindly copying the old approach that might have worked well in SQL Server, you should understand how Postgres works and change the approach to match the best practices in Postgres.
I have seen many migrations fail or deliver mediocre performance because of the assumption that the best practices for "System A" can be applied without any change to "System B". You need to migrate your mindset as well.
If this checks the columns of a temp table, then why not simply check the actual size of the column values using pg_column_size()?
I have a table
T (A int, B int, C long, D varchar)
partitioned by each A and sub-partitioned by each B (i.e. list partitions with a single value each). A has cardinality of <10 and B has cardinality of <100. T has about 6 billion rows.
When I run the query
select distinct B from T where A = 1;
it prunes the top-level partitions (those where A != 1) but performs a table scan on all sub-partitions to find distinct values of B. I thought it would know, based on the partition design, that it would only have to check the partition constraint to determine the possible values of B given A, but alas, that is not the case.
There are no indexes on A or B, but there is a primary key on (C,D) at each partition, which seems immaterial, but figured I should mention it. I also have a BRIN index on C. Any idea why the Postgres query planner is not consulting the sub-partition constraints to avoid the table scan?
The reason is that nobody implemented such an optimization in the query planner. I cannot say that that surprises me, since it is a rather unusual query. Every such optimization built into the optimizer would mean that each query on a partitioned table that has a DISTINCT would need some extra query planning time, while only few queries would profit. Apart from the expense of writing and maintaining the code, that would be a net loss for most users.
Maybe you could use a metadata query:
CREATE TABLE list (id bigint NOT NULL, p integer NOT NULL) PARTITION BY LIST (p);
CREATE TABLE list_42 PARTITION OF list FOR VALUES IN (42);
CREATE TABLE list_101 PARTITION OF list FOR VALUES IN (101);
SELECT regexp_replace(
pg_get_expr(
p.relpartbound,
p.oid
),
'^FOR VALUES IN \((.*)\)$',
'\1'
)::integer
FROM pg_class AS p
JOIN pg_inherits AS i ON p.oid = i.inhrelid
WHERE i.inhparent = 'list'::regclass;
regexp_replace
----------------
42
101
(2 rows)
I have a table with three columns A, B, C, all of type bytea.
There are around 180,000,000 rows in the table. A, B and C all have exactly 20 bytes of data, C sometimes contains NULLs
When creating indexes for all columns with
CREATE INDEX index_A ON transactions USING hash (A);
CREATE INDEX index_B ON transactions USING hash (B);
CREATE INDEX index_C ON transactions USING hash (C);
index_A is created in around 10 minutes, while B and C are taking over 10 hours after which I aborted them. I ran every CREATE INDEX on their own, so no indices were created in parallel. There are also no other queries running in the database.
When running
SELECT * FROM pg_stat_activity;
wait_event_type and wait_event are both NULL, state is active.
Why are the second index creations taking so long, and can I do anything to speed them up?
Ensure the statistics on your table are up-to-date.
Then execute the following query:
SELECT attname, n_distinct, correlation
from pg_stats
where tablename = '<Your table name here>'
Basically, the database will have more work to create indexes when:
The number of distinct values gets higher.
The correlation (= are values in the field physically stored in order) is close to 0.
I suspect you will see field A is different in terms of distinct values and/or a higher correlation than the other 2 fields.
Edit: Basically, creating an index = FULL SCAN of the table and create entries in the index as you progress. With the stats you have shared below that means:
Column A: it was detected as unique
A single scan is enough as the DB knows 1 record = 1 index entry.
Columns B & C : it was detected as having very few distinct values + abs(correlation) is very low.
Each index entry takes an entire FULL SCAN of the table.
Note: the description is simplified to highlight the difference.
Solution 1:
Do not create indexes for B and C.
It might sound stupid but in fact and as explained here, a small correlation means the indexes will probably not be used (an index is useful only when entries are not scattered in all the table blocks).
Solution 2:
Order records on the disk.
The initialization would be something like this:
CREATE TABLE Transactions_order as SELECT * FROM Transactions;
TRUNCATE TABLE Transactions;
INSERT INTO Transactions SELECT * FROM Transactions_order ORDER BY B,C,A;
DROP TABLE Transactions_order;
The tricky part comes next: with insert/update/delete records, you need to keep track of the correlation and ensure it does not drop too much.
If you can't guarantee that, stick to solution 1.
Solution3:
Create partitions and enjoy partition pruning.
There are quite a lot of efforts being made for partitioning recently in postgresql. It could be worth having a look into it.
I am fairly new to DB2 (and SQL in general) and I am having trouble finding an efficient method to DECODE columns
Currently, the database has a number of tables most of which have a significant number of their columns as numbers, these numbers correspond to a table with the real values. We are talking 9,500 different values (e.g '502=yes' or '1413= Graduate Student')
In any situation, I would just do WHERE clause and show where they are equal, but since there are 20-30 columns that need to be decoded per table, I can't really do this (that I know of).
Is there a way to effectively just display the corresponding value from the other table?
Example:
SELECT TEST_ID, DECODE(TEST_STATUS, 5111, 'Approved, 5112, 'In Progress') TEST_STATUS
FROM TEST_TABLE
The above works fine.......but I manually look up the numbers and review them to build the statements. As I mentioned, some tables have 20-30 columns that would need this AND some need DECODE statements that would be 12-15 conditions.
Is there anything that would allow me to do something simpler like:
SELECT TEST_ID, DECODE(TEST_STATUS = *TableWithCodeValues*) TEST_STATUS
FROM TEST_TABLE
EDIT: Also, to be more clear, I know I can do a ton of INNER JOINS, but I wasn't sure if there was a more efficient way than that.
From a logical point of view, I would consider splitting the lookup table into several domain/dimension tables. Not sure if that is possible to do for you, so I'll leave that part.
As mentioned in my comment I would stay away from using DECODE as described in your post. I would start by doing it as usual joins:
SELECT a.TEST_STATUS
, b.TEST_STATUS_DESCRIPTION
, a.ANOTHER_STATUS
, c.ANOTHER_STATUS_DESCRIPTION
, ...
FROM TEST_TABLE as a
JOIN TEST_STATUS_TABLE as b
ON a.TEST_STATUS = b.TEST_STATUS
JOIN ANOTHER_STATUS_TABLE as c
ON a.ANOTHER_STATUS = c.ANOTHER_STATUS
JOIN ...
If things are too slow there are a couple of things you can try:
Create a statistical view that can help determine cardinalities from the joins (may help the optimizer creating a better plan):
https://www.ibm.com/support/knowledgecenter/sl/SSEPGG_9.7.0/com.ibm.db2.luw.admin.perf.doc/doc/c0021713.html
If your license admits you can experiment with Materialized Query Tables (MQT). Note that there is a penalty for modifications of the base tables, so if you have more of a OLTP workload, this is probably not a good idea:
https://www.ibm.com/developerworks/data/library/techarticle/dm-0509melnyk/index.html
A third option if your lookup table is fairly static is to cache the lookup table in the application. Read the TEST_TABLE from the database, and lookup descriptions in the application. Further improvements may be to add triggers that invalidate the cache when lookup table is modified.
If you don't want to do all these joins you could create yourself an own LOOKUP function.
create or replace function lookup(IN_ID INTEGER)
returns varchar(32)
deterministic reads sql data
begin atomic
declare OUT_TEXT varchar(32);--
set OUT_TEXT=(select text from test.lookup where id=IN_ID);--
return OUT_TEXT;--
end;
With a table TEST.LOOKUP like
create table test.lookup(id integer, text varchar(32))
containing some id/text pairs this will return the text value corrseponding to an id .. if not found NULL.
With your mentioned 10k id/text pairs and an index on the ID field this shouldn't be a performance issue as such data amount should be easily be cached in the corresponding bufferpool.
I had a complicated table which has only 7 columns but in production it will have many rows say more than 100,000 rows..
so for this i execute RUNSTATS for two columns one is PK and another is FK..
RUNSTATS ON TABLE WEBSS.P0029_LOCATION WITH DISTRIBUTION ON COLUMNS (LOC_ID, OUTLET_ID);
after this when i run
SELECT * FROM SYSCAT.COLDIST WHERE TABSCHEMA = 'WEBSS' AND TABNAME = 'P0029_LOCATION'
In result i had 60 rows.. 30 rows each for two columns..
in that i had type is Q and F.. Quantile and Frequeency..
But i need little more input on this.. on what basis they(Q and F) are defined..
on what basis do we need to optimise.
Please pour your suggestions.
There are two type of column statistics on DB2, simple ones where you just get the column cardinality and the number of nulls, and distribution stats as you have collected above.
I found simple statistics are better for most applications unless you do literal searches on highly skewed data.
If you have indexes defined on you PKs and FKs you get simple stats with
RUNSTATS ON MYTABLE ON KEY COLUMNS
or
RUNSTATS ON MYTABLE ON ALL COLUMNS
The quantiles are histogram data, and you get by default I think 20 histogram values for each, and the F are the most popular values in your column, and I then you get 10 by default. You don't need distributions on a PK, as it's unique, and it's unlikely you need them on an FK as well. Stick to the simple ones first.