Have a 3 part composite key Int, Int, Int on a large table
Insert speed degrades due to fragmentation
PK1 does not fragment (inserts are in order and never revised)
But PK2,and PK3 fragment badly and quickly
What strategy should I use for index maintenance?
Is there a way to Rebuild the index with?
PK1 fill factor 100
PK2 fill factor 10
PK3 fill factor 10
No - it's ONE index - you cannot have different fill factors on the columns of a single index ... the index structure is made up of entries of (PK1, PK2, PK3) and this tuple combined is stored on the pages. You can only set fill factors for the index/page - not for individual parts of a compound index.
My typical approach would be to use something like 70% or 80% on an index I suspect of fragmentation, and then just observe. See how fast and how badly it fragments. If it's unbearable later in the day - lower the fill factor even more. Typically, with a 70-80% fill factor, you should be fine during the day, and if you rebuild those critical indexes every night, your system should work fine.
Related
I am needing to partially index a column when a single condition is met for a column (ex. some_column = 'some_value'). I am worried about the customer impact of triggering this new partial index and locking the table and am wondering how long that will take. In the databases where I am worried about the impact, there will be no records that meet the condition. Does this mean the overhead and time the table is locked would be drastically less than if there were records to index at the time of the index creation? The column in the where condition is indexed.
It will not use the index on the column in the WHERE to speed up creation of the empty partial index. It will still scan the full table, at however long it takes to do that. Not needing to sort any tuples or generate any index leaf blocks will speed it up, but probably not 'drastically'.
If you are afraid it will hold the lock too long, you can create the index CONCURRENTLY. This will take longer to do, but will hold a weaker lock while it does it. It will still need a strong lock at the beginning and at the end, but it will only be held momentarily.
We have a Postgres 11.2 database which stores time-series of values against a composite key. Given 1 or a number of keys, the query tries to find the latest value(s) in each time-series given a time constraint.
We suffer query timeouts when the data is not cached, because it seems to have to walk a huge number of pages in order to find the data.
Here is the relevant section in the explain. We are getting the data for a single time-series (with 367 values in this example):
-> Index Scan using quotes_idx on quotes q (cost=0.58..8.61 rows=1 width=74) (actual time=0.011..0.283 rows=367 loops=1)
Index Cond: ((client_id = c.id) AND (quote_detail_id = qd.id) AND (effective_at <= '2019-09-26 00:59:59+01'::timestamp with time zone) AND (effective_at >= '0001-12-31 23:58:45-00:01:15 BC'::timestamp with time zone))
Buffers: shared hit=374
This is the definition of the index in question:
CREATE UNIQUE INDEX quotes_idx ON quotes.quotes USING btree (client_id, quote_detail_id, effective_at);
Where the columns are 2x int4 and a timestamptz, respectively.
Assuming I'm reading the output correctly, why is the engine walking 374 pages (~3Mb, given our 8kb page size) in order to return ~26kb of data (367 rows of width 74 bytes)?
When we scale up the number of keys (say, 500) the engine ends up walking over 150k pages (over 1GB), which when not cached, takes a significant time.
Note, the average row size in the underlying table is 82 bytes (over 11 columns), and contains around 700mi rows.
Thanks in advance for any insights!
The 367 rows found in your index scan are probably stored in more than 300 table blocks (that is not surprising in a large table). So PostgreSQL has to access all these blocks to come up with a result.
This would perform much better if the rows were all concentrated in a few table blocks. In other words, if the logical ordering of the index would correspond to the physical order of the rows in the table. In PostgreSQL terms, a high correlation would be beneficial.
You can force PostgreSQL to rewrite the entire table in the correct order with
CLUSTER quotes USING quotes_idx;
Then your query should become much faster.
There are some disadvantages though:
While CLUSTER is running, the table is not accessible. This usually means down time.
Right after CLUSTER, performance will be good, but PostgreSQL does not maintain the ordering. Subsequent data modifications will reduce the correlation.
To keep the query performing well, you'll have to schedule CLUSTER regularly.
Reading 374 blocks to obtain 367 rows is not unexpected. CLUSTERing the data is one way to address that, as already mentioned. Another possibility is to add some more columns into the index column list (by creating a new index and dropping the old one), so that the query can be satisfied with an index-only-scan.
This requires no down-time if the index is created concurrently. You do have to keep the table well-vacuumed, which can be tricky to do as the autovacuum parameters were really not designed with IOS in mind. It requires no maintenance, other than the vacuuming, so I would prefer this method if the list (and size) of columns you need to add to the index is small.
To give you a bit of background. I have a process which does this large complex calculation which takes a while to complete. It runs on a timer. After some investigation I realise that what is causing the slowness isn't the actual calculation but the internal q function, union.
I am trying to union two simple tables, table A and table B. A is approximately 5m rows and B is 500. Both tables have only two columns. First column is a symbol. Table A is actually a compound primary key of a table. (Also, how do you copy directly from the console?)
n:5000000
big:([]n?`4;n?100)
small:([]500?`4;500?100)
\ts big union small
I tried keying both columns and upserting, join and then distinct, "big, small where not small in big" but nothing seems to work :(
Any help will be appreciated!
If you want to upsert the big table it has to be keyed and upsert operator should be used. For example
n:5000000
//big ids are unique numbers from 0 to 499999
//table is keyed with 1! operator
big:1!([]id:(neg n)?n;val:n?100)
//big ids are unique numbers. 250 from 0-4999999 and 250 from 500000-1000000 intervals
small:([]id:(-250?n),(n+-250?n);val:500?100)
If big is global variable it is efficient to upsert it as
`big upsert small
if big is local
big: big upsert small
As the result big will have 500250 elements, because there are 250 common keys (id column) in big and small tables
this may not be relevant, but just a quick thought. If your big table has a column which has type `sym and if this column does not really show up that much throughout your program, why not cast it to string or other value? if you are doing this update process every single day then as the data gets packed in your partitioned hdb, whenever the new data is added, kdb+ process has to reassign/rewrite its sym file and i believe this is the part that actually takes a lot of time, not the union calculation itself..
if above is true, i'd suggest either rewriting your schema for the table which minimises # of rehashing(not sure if this is the right term though!) on your symfile. or, as the above person mentioned, try to assign attribute to your table.. this may reduce the time too.
In my case, I have to load huge data from one table to another. (Tera to sqlserver). Using JdbcCursorItemReader, on avg it takes 30 mins to load 200000 records since the table has 40 columns. So I am planning to use the partition technique.
Below are the challenges
The table has a composite primary key (2 columns).
And one of the column values having negative values.
Is this possible to do columnpartition technique in this case?
I see the columnpartition technique uses one primary key and finding max and min values. In my case, with composite primary , even if i figure someway for max, min, grid size. will the framework support to handle the composite primarykey for doing paritioning?
A couple things to note here:
JdbcCursorItemReader is not thread safe so it typically isn't used in partitioning scenarios. Instead the JdbcPagingItemReader is used.
Your logic to partition the data is purely up to you. While doing it via values in a column is useful, it doesn't apply to all use cases (like this one). In this specific use case, you may want to partition by ROW_NUMBER() or something similar, or add a column to partition by.
I am trying to create an index on one of my tables with an accurate label. Here is how I am trying it...expecting "sysname" to resolve to the column or table name. But after I run this command and view it in the Object Explorer, it is listed as
"[<Name of Missing Index, sysname + '_prod',>]".
How do u define index_names in a better descriptive fashion? (I am trying to add the extension "_prod" to the index_name, since INDEX of index_name already exists).
USE [AMDMetrics]
GO
CREATE NONCLUSTERED INDEX
[<Name of Missing Index, sysname + '_prod',>]
ON [SMARTSOLVE].[V_CXP_CUSTOMER_PXP] ([QXP_UDF_STRING_8], [QXP_REPORT_DATE],
[QXP_XRS_DESCRIPTION])
INCLUDE ([QXP_ID], [QXP_EXCEPTION_NO], [QXP_BASE_EXCEPTION], [QXP_CATEGORY],
[QXP_OCCURENCE_DATE], [QXP_COORD_ID], [QXP_SHORT_DESC], [QXP_ROOT_CAUSE],
[QXP_DESCRIPTION], [QXP_QEI_ID], [PXP_LOT_NUMBER], [CXP_ID], [CXP_AWARE_DATE],
[QXP_XSV_CODE], [QXP_COORD_NAME], [PXP_PRODUCT_CODE], [PXP_PRODUCT_NAME],
[QXP_ORU_NAME], [QXP_RESOLUTION_DESC], [QXP_CLOSED_DATE], [CXP_CLIENT_CODE],
[CXP_CLIENT_NAME])
I'm not 100% sure what you are trying to do, but it seems like you are trying to find a way to properly name your index (or find a good naming convention). Conventions are best when they are easy to follow, and make sense to people without having to explain it to them. A lot of different conventions fit this MO, but the one that is most common is this:
Index Type Prefix Complete Index name
-------------------------------------------------------------------
Index (not unique, non clustered) IDX_ IDX_<name>_<column>
Index (unique, non clustered) UDX_ UDX_<name>_<column>
Index (not unique, clustered) CIX_ CIX_<name>_<column>
Index (unique, clustered) CUX_ CUX_<name>_<column>
Although on a different note, I have to question why you have so many columns in your INCLUDE list....without knowing the size of those columns, there are some drawbacks to adding so many columns:
Avoid adding unnecessary columns. Adding too many index columns,
key or nonkey, can have the following performance implications:
- Fewer index rows will fit on a page. This could create I/O increases
and reduced cache efficiency.
- More disk space will be required to store the index. In particular,
adding varchar(max), nvarchar(max), varbinary(max), or xml data types
as nonkey index columns may significantly increase disk space requirements.
This is because the column values are copied into the index leaf level.
Therefore, they reside in both the index and the base table.
- Index maintenance may increase the time that it takes to perform modifications,
inserts, updates, or deletes, to the underlying table or indexed view.
You will have to determine whether the gains in query performance outweigh
the affect to performance during data modification and in additional disk
space requirements.
From here: http://msdn.microsoft.com/en-us/library/ms190806.aspx