I am looking through some table source code, and I'm not sure how indexes work. Here is the example I see.
CREATE INDEX INDEXNAME ON AWARD ( AWARD_ID, CUST_ID );
I don't understand what the parameter values mean. Is the index for each individual column, or is it combined together to become an index.
It's an index that contains two fields. Indexes in general are used for Selection, Joining, Grouping and/or Ordering.
The key thing to realize is that a multi column index is useful from left to right.
For selection, such an index would be quite useful if you had a where clause that looked like
WHERE AWARD_ID = 123 AND CUST_ID = 456
If would also be helpful for
WHERE AWARD_ID = 123
But probably not (directly) helpful for
WHERE CUST_ID = 456
Since the leftmost column of the index (AWARD_ID) is not referenced.
Joining works in a similar manner.
--index useful
FROM TBLA JOIN AWARD USING (AWARD_ID,CUST_ID)
or
FROM TBLA JOIN AWARD USING (AWARD_ID)
--index NOT useful
FROM TBLA JOIN AWARD USING (CUST_ID)
as do ordering and grouping.
If you happen to be using DB2 for IBM i, the following paper is an awesome resource:
http://www-304.ibm.com/partnerworld/wps/servlet/ContentHandler/stg_ast_sys_wp_db2_i_indexing_methods_strategies
If DB2 LUW, the parts about bitmap indexs probably apply, but ignore the information about EVI indexes. Also look at the planning and performance sub-sections Indexes section in the DB2 LUW infocenter
http://pic.dhe.ibm.com/infocenter/db2luw/v10r5/topic/com.ibm.db2.luw.admin.perf.doc/com.ibm.db2.luw.admin.perf.doc-gentopic8.html
Related
I have a table with the following columns:
ID (VARCHAR)
CUSTOMER_ID (VARCHAR)
STATUS (VARCHAR) (4 different status possible)
other not relevant columns
I try to find all the lines with customer_id = and status = two different status.
The query looks like:
SELECT *
FROM my_table
WHERE customer_id = '12345678' AND status IN ('STATUS1', 'STATUS2');
The table contains about 1 mio lines. I added two indexes on customer_id and status. The query still needs about 1 second to run.
The explain plan is:
1. Gather
2. Seq Scan on my_table
Filter: (((status)::text = ANY ('{SUBMITTED,CANCELLED}'::text[])) AND ((customer_id)::text = '12345678'::text))
I ran the 'analyze my_table' after creating the indexes. What could I do to improve the performance of this quite simple query?
You need a compound (multi-column) index to help satisfy your query.
Guessing, it seems like the most selective column (the one with the most distinct values) is customer_id. status probably has only a few distinct values. So customer_id should go first in the index. Try this.
ALTER TABLE my_table ADD INDEX customer_id_status (customer_id, status);
This creates a BTREE index. A useful mental model for such an index is an old-fashioned telephone book. It's sorted in order. You look up the first matching entry in the index, then scan it sequentially for the items you want.
You may also want to try running ANALYZE my_table; to update the statistics (about selectivity) used by the query planner to choose an appropriate index.
Pro tip Avoid SELECT * if you can. Instead name the columns you want. This can help performance a lot.
Pro tip Your question said some of your columns aren't relevant to query optimization. That's probably not true; index design is a weird art. SELECT * makes it less true.
I wanted to load the table which is having a table size of more than 1 TB size from S3 to Redshift.
I cannot use DISTSTYLE as ALL because it is a big table.
I cannot use DISTSTYLE as EVEN because I want to use this table in joins which are making performance issue.
Columns on my table are
id INTEGER, name VARCHAR(10), another_id INTEGER, workday INTEGER, workhour INTEGER, worktime_number INTEGER
Our redshift cluster has 20 nodes.
So, I tried distribution key on a workday but the table is badly skewed.
There are 7 unique work days and 24 unique work hours.
How to avoid the skew in such cases?
How we avoid skewing of the table in case of an uneven number of row counts for the unique key (let's say hour1 have 1million rows, hour2 have 1.5million rows, hour3 have 2million rows, and so on)?
Distribute your table using DISTSTYLE EVEN and use either SORTKEY or COMPOUND SORTKEY. Sort Key will help your query performance. Try this first.
DISTSTYLE/DISTKEY determines how your data is distributed. From the columns used in your queries, it is advised choose a column that causes the least amount of skew as the DISTKEY. A column which has many distinct values, such as timestamp, would be a good first choice. Avoid columns with few distinct values, such as credit card types, or days of week.
You might need to recreate your table with different DISTKEY / SORTKEY combinations and try out which one will work best based on your typical queries.
For more info https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-sort-key.html
Here is the architecture that I recommend
1) load to a staging table with dist even and sort by something that is sorted on your loaded s3 data - this means you will not have to vacuum the staging table
2) set up a production table with the sort / dist you need for your queries. after each copy from s3, load that new data into the production table and vacuum.
3) you may wish to have 2 mirror production tables and flip flop between them using a late binding view.
its a bit complex to do this you need may need some professional help. There may be specifics to your use case.
As of writing this(Just after Re-invent 2018), Redshift has Automatic Distribution available, which is a good starter.
The following utilities will come in handy:
https://github.com/awslabs/amazon-redshift-utils/tree/master/src/AdminScripts
As indicated in Answers POSTED earlier try a few combinations by replicating the same table with different DIST keys ,if you don't like what Automatic DIST is doing. After the tables are created run the admin utility from the git repos (preferably create a view on the SQL script in the Redshift DB).
Also, if you have good clarity on query usage pattern then you can use the following queries to check how well the sort key are performing using the below SQLs.
/**Queries on tables that are not utilizing SORT KEYs**/
SELECT t.database, t.table_id,t.schema, t.schema || '.' || t.table AS "table", t.size, nvl(s.num_qs,0) num_qs
FROM svv_table_info t
LEFT JOIN (
SELECT tbl, COUNT(distinct query) num_qs
FROM stl_scan s
WHERE s.userid > 1
AND s.perm_table_name NOT IN ('Internal Worktable','S3')
GROUP BY tbl) s ON s.tbl = t.table_id
WHERE t.sortkey1 IS NULL
ORDER BY 5 desc;
/**INTERLEAVED SORT KEY**/
--check skew
select tbl as tbl_id, stv_tbl_perm.name as table_name,
col, interleaved_skew, last_reindex
from svv_interleaved_columns, stv_tbl_perm
where svv_interleaved_columns.tbl = stv_tbl_perm.id
and interleaved_skew is not null;
of course , there is always room for improvement in the SQLs above, depending on specific stats that you may want to look at or drill down to.
Hope this helps.
I was in mysql and now have joined postgres and I have a table that is getting up to 300,000 new records a day but also has many reads. I have 2 columns that I think would be ideal for indexes: latitudes and longitudes and I Know that postgres has different types of indexes and my question is which type of index would be best for a table that has many writes and reads? This is the query for the reads
SELECT p.fullname,s.post,to_char(s.created_on, 'MON DD,YYYY'),last_reply,s.id,
r.my_id,s.comments,s.city,s.state,p.reputation,s.profile_id
FROM profiles as p INNER JOIN streams as s ON (s.profile_id=p.id) Left JOIN
reputation as r ON (r.stream_id=s.id and r.my_id=?) where s.latitudes >=?
AND ?>= s.latitudes AND s.longitudes>=? AND ?>=s.longitudes order by
s.last_reply desc limit ?"
As you can see the 2 columns in the where clause are latitudes and longitudes
PostgreSQL has the point data type with many operators that have good support from the gist index. So if at all possible change your table definition to use a point rather than 2 floats.
Inserting point data is very easy, just use point(longitudes, latitudes) for the column, instead of putting the two values in separate columns. Same with getting data out: lnglat[0] is the longitude and lnglat[1] is the latitude.
The index would be something like this:
CREATE INDEX idx_mytable_lnglat ON streams USING gist (lnglat pointops);
There is also the box data type, which would be great for grouping all your parameters and finding a point in a box is highly optimized in the gist index.
With a point in the table and a box to search on, your query reduces to this:
SELECT p.fullname, s.post, to_char(s.created_on, 'MON DD,YYYY'), last_reply, s.id,
r.my_id, s.comments, s.city, s.state, p.reputation, s.profile_id
FROM profiles AS p
JOIN streams AS s ON (s.profile_id = p.id)
LEFT JOIN reputation AS r ON r.stream_id = s.id AND r.my_id = ?
WHERE s.lnglat && box(?, ?, ?, ?)
ORDER BY s.last_reply DESC
LIMIT ?;
The phrase s.lnglat && box(?, ?, ?, ?) means "the value of column lnglat overlaps with (meaning: is inside) the box".
If the latitude or longitude columns are sorted, you would probably want to use a B-tree index.
From the Postgres documentation page on indices:
B-trees can handle equality and range queries on data that can be sorted into some ordering. In particular, the PostgreSQL query planner will consider using a B-tree index whenever an indexed column is involved in a comparison using one of the [greater than / lesser than-type operators]
You can read more about indices here.
Edit: Some of the G* indices look like they might be of use if you need to index on both latitude and longitude, since they appear to allow multi-dimensional (e.g. 2d) indexing.
Edit2: In order to actually create the index, you'd want to do something along the lines of (although you may need to change the table name to suite your needs):
CREATE INDEX idx_lat ON s(latitudes);
Take note that B-tree indices are default so you don't need to specify the type.
Read more about index creation here.
I have three tables in my app, call them tableA, tableB, and tableC. tableA has fields for tableB_id and tableC_id, with indexes on both. tableB has a field foo with an index, and tableC has a field bar with an index.
When I do the following query:
select *
from tableA
left outer join tableB on tableB.id = tableA.tableB_id
where lower(tableB.foo) = lower(my_input)
it is really slow (~1 second).
When I do the following query:
select *
from tableA
left outer join tableC on tableC.id = tabelA.tableC_id
where lower(tableC.bar) = lower(my_input)
it is really fast (~20 ms).
From what I can tell, the tables are about the same size.
Any ideas as to the huge performance difference between the two queries?
UPDATES
Table sizes:
tableA: 2061392 rows
tableB: 175339 rows
tableC: 1888912 rows
postgresql-performance tag info
Postgres version - 9.3.5
Full text of the queries are above.
Explain plans - tableB tableC
Relevant info from tables:
tableA
tableB_id, integer, no modifiers, storage plain
"index_tableA_on_tableB_id" btree (tableB_id)
tableC_id, integer, no modifiers, storage plain,
"index_tableA_on_tableB_id" btree (tableC_id)
tableB
id, integer, not null default nextval('tableB_id_seq'::regclass), storage plain
"tableB_pkey" PRIMARY_KEY, btree (id)
foo, character varying(255), no modifiers, storage extended
"index_tableB_on_lower_foo_tableD" UNIQUE, btree (lower(foo::text), tableD_id)
tableD is a separate table that is otherwise irrelevant
tableC
id, integer, not null default nextval('tableC_id_seq'::regclass), storage plain
"tableC_pkey" PRIMARY_KEY, btree (id)
bar, character varying(255), no modifiers, storage extended
"index_tableC_on_tableB_id_and_bar" UNIQUE, btree (tableB_id, bar)
"index_tableC_on_lower_bar" btree (lower(bar::text))
Hardware:
OS X 10.10.2
CPU: 1.4 GHz Intel Core i5
Memory: 8 GB 1600 MHz DDR3
Graphics: Intel HD Graphics 5000 1536 MB
Solution
Looks like running vacuum and then analyze on all three tables fixed the issue. After running the commands, the slow query started using "index_patients_on_foo_tableD".
The other thing is that you have your indexed columns queried as lower() , which can also be creating a partial index when the query is running.
If you will always query the column as lower() then your column should be indexed as lower(column_name) as in:
create index idx_1 on tableb(lower(foo));
Also, have you looked at the execution plan? This will answer all your questions if you can see how it is querying the tables.
Honestly, there are many factors to this. The best solution is to study up on INDEXES, specifically in Postgres so you can see how they work. It is a bit of holistic subject, you can't really answer all your problems with a minimal understanding of how they work.
For instance, Postgres has an initial "lets look at these tables and see how we should query them" before the query runs. It looks over all tables, how big each of the tables are, what indexes exist, etc. and then figures out how the query should run. THEN it executes it. Oftentimes, this is what is wrong. The engine incorrectly determines how to execute it.
A lot of the calculations of this are done off of the summarized table statistics. You can reset the summarized table statistics for any table by doing:
vacuum [table_name];
(this helps to prevent bloating from dead rows)
and then:
analyze [table_name];
I haven't always seen this work, but often times it helps.
ANyway, so best bet is to:
a) Study up on Postgres indexes (a SIMPLE write up, not something ridiculously complex)
b) Study up the execution plan of the query
c) Using your understanding of Postgres indexes and how the query plan is executing, you cannot help but solve the exact problem.
For starters, your LEFT JOIN is counteracted by the predicate on the left table and is forced to act like an [INNER] JOIN. Replace with:
SELECT *
FROM tableA a
JOIN tableB b ON b.id = a.tableB_id
WHERE lower(b.foo) = lower(my_input);
Or, if you actually want the LEFT JOIN to include all rows from tableA:
SELECT *
FROM tableA a
LEFT JOIN tableB b ON b.id = a.tableB_id
AND lower(b.foo) = lower(my_input);
I think you want the first one.
An index on (lower(foo::text)) like you posted is syntactically invalid. You better post the verbatim output from \d tbl in psql like I commented repeatedly. A shorthand syntax for a cast (foo::text) in an index definition needs more parentheses, or use the standard syntax: cast(foo AS text):
Create index on first 3 characters (area code) of phone field?
But that's also unnecessary. You can just use the data type (character varying(255)) of foo. Of course, the data type character varying(255) rarely makes sense in Postgres to begin with. The odd limitation to 255 characters is derived from limitations in other RDBMS which do not apply in Postgres. Details:
Refactor foreign key to fields
Be that as it may. The perfect index for this kind of query would be a multicolumn index on B - if (and only if) you get index-only scans out of this:
CREATE INDEX "tableB_lower_foo_id" ON tableB (lower(foo), id);
You can then drop the mostly superseded index "index_tableB_on_lower_foo". Same for tableC.
The rest is covered by the (more important!) indices in table A on tableB_id and tableC_id.
If there are multiple rows in tableA per tableB_id / tableC_id, then either one of these competing commands can swing the performance to favor the respective query by physically clustering related rows together:
CLUSTER tableA USING "index_tableA_on_tableB_id";
CLUSTER tableA USING "index_tableA_on_tableC_id";
You can't have both. It's either B or C. CLUSTER also does everything a VACUUM FULL would do. But be sure to read the details first:
Optimize Postgres timestamp query range
And don't use mixed case identifiers, sometimes quoted, sometimes not. This is very confusing and is bound to lead to errors. Use legal, lower-case identifiers exclusively - then it doesn't matter if you double-quote them or not.
SELECT DISTINCT tblJobReq.JobReqId
, tblJobReq.JobStatusId
, tblJobClass.JobClassId
, tblJobClass.Title
, tblJobReq.JobClassSubTitle
, tblJobAnnouncement.JobClassDesc
, tblJobAnnouncement.EndDate
, blJobAnnouncement.AgencyMktgVerbage
, tblJobAnnouncement.SpecInfo
, tblJobAnnouncement.Benefits
, tblSalary.MinRateSal
, tblSalary.MaxRateSal
, tblSalary.MinRateHour
, tblSalary.MaxRateHour
, tblJobClass.StatementEval
, tblJobReq.ApprovalDate
, tblJobReq.RecruiterId
, tblJobReq.AgencyId
FROM ((tblJobReq
LEFT JOIN tblJobAnnouncement ON tblJobReq.JobReqId = tblJobAnnouncement.JobReqId)
INNER JOIN tblJobClass ON tblJobReq.JobClassId = tblJobClass.JobClassId)
LEFT JOIN tblSalary ON tblJobClass.SalaryCode = tblSalary.SalaryCode
WHERE (tblJobReq.JobClassId in (SELECT JobClassId
from tblJobClass
WHERE tblJobClass.Title like '%Family Therapist%'))
When i try to execute the query it results in the following error.
Cannot sort a row of size 8130, which is greater than the allowable maximum of 8094
I checked and didn't find any solution. The only way is to truncate (substring())the "tblJobAnnouncement.JobClassDesc" in the query which has column size of around 8000.
Do we have any work around so that i need not truncate the values. Or Can this query be optimised? Any setting in SQL Server 2000?
The [non obvious] reason why SQL needs to SORT is the DISTINCT keyword.
Depending on the data and underlying table structures, you may be able to do away with this DISTINCT, and hence not trigger this error.
You readily found the alternative solution which is to truncate some of the fields in the SELECT list.
Edit: Answering "Can you please explain how DISTINCT would be the reason here?"
Generally, the fashion in which the DISTINCT requirement is satisfied varies with
the data context (expected number of rows, presence/absence of index, size of row...)
the version/make of the SQL implementation (the query optimizer in particular receives new or modified heuristics with each new version, sometimes resulting in alternate query plans for various constructs in various contexts)
Yet, all the possible plans associated with a "DISTINCT query" involve *some form* of sorting of the qualifying records. In its simplest form, the plan "fist" produces the list of qualifying rows (records) (the list of records which satisfy the WHERE/JOINs/etc. parts of the query) and then sorts this list (which possibly includes some duplicates), only retaining the very first occurrence of each distinct row. In other cases, for example when only a few columns are selected and when some index(es) covering these columns is(are) available, no explicit sorting step is used in the query plan but the reliance on an index implicitly implies the "sortability" of the underlying columns. In other cases yet, steps involving various forms of merging or hashing are selected by the query optimizer, and these too, eventually, imply the ability of comparing two rows.
Bottom line: DISTINCT implies some sorting.
In the specific case of the question, the error reported by SQL Server and preventing the completion of the query is that "Sorting is not possible on rows bigger than..." AND, the DISTINCT keyword is the only apparent reason for the query to require any sorting (BTW many other SQL constructs imply sorting: for example UNION) hence the idea of removing the DISTINCT (if it is logically possible).
In fact you should remove it, for test purposes, to assert that, without DISTINCT, the query completes OK (if only including some duplicates). Once this fact is confirmed, and if effectively the query could produce duplicate rows, look into ways of producing a duplicate-free query without the DISTINCT keyword; constructs involving subqueries can sometimes be used for this purpose.
An unrelated hint, is to use table aliases, using a short string to avoid repeating these long table names. For example (only did a few tables, but you get the idea...)
SELECT DISTINCT JR.JobReqId, JR.JobStatusId,
tblJobClass.JobClassId, tblJobClass.Title,
JR.JobClassSubTitle, JA.JobClassDesc, JA.EndDate, JA.AgencyMktgVerbage,
JA.SpecInfo, JA.Benefits,
S.MinRateSal, S.MaxRateSal, S.MinRateHour, S.MaxRateHour,
tblJobClass.StatementEval,
JR.ApprovalDate, JR.RecruiterId, JR.AgencyId
FROM (
(tblJobReq AS JR
LEFT JOIN tblJobAnnouncement AS JA ON JR.JobReqId = JA.JobReqId)
INNER JOIN tblJobClass ON tblJobReq.JobClassId = tblJobClass.JobClassId)
LEFT JOIN tblSalary AS S ON tblJobClass.SalaryCode = S.SalaryCode
WHERE (JR.JobClassId in
(SELECT JobClassId from tblJobClass
WHERE tblJobClass.Title like '%Family Therapist%'))
FYI, running this SQL command on your DB can fix the problem if it is caused by space that needs to be reclaimed after dropping variable length columns:
DBCC CLEANTABLE (0,[dbo.TableName])
See: http://msdn.microsoft.com/en-us/library/ms174418.aspx
This is a limitation of SQL Server 2000. You can:
Split it into two queries and combine elsewhere
SELECT ID, ColumnA, ColumnB FROM TableA JOIN TableB
SELECT ID, ColumnC, ColumnD FROM TableA JOIN TableB
Truncate the columns appropriately
SELECT LEFT(LongColumn,2000)...
Remove any redundant columns from the SELECT
SELECT ColumnA, ColumnB, --IDColumnNotUsedInOutput
FROM TableA
Migrate off of SQL Server 2000