incremental update of a postgres index - postgresql

I have inserted a lot of data (more than 2 millions documents) in a table and created an a full text search index using GIN and it works great. I can query the database and retrieve the apropriate documents rapidly.
Regularly, I collect new data that I can insert in the database. What I would like to do is to update my index with the new data only, but I have failed so far. I don't want to drop the index and recreate it because it takes ages to recreate it. I basically would like to do an incremental update of the index. I can do that on the fly when data is being inserted but this is very very slow. I read that creating an index on inserted data was faster (true) so I guessed that updating an index on the new data could be done. But I can't do it so far.
I use postgresql 12.
Can anybody help me, please?

There is no way to suspend adding values to the index while you load data.
But GIN indexes already have a feature to optimize that: the GIN fast update technique.
If you set the gin_pending_list_limit storage parameter to the index to a high value. Once you are done with the bulk load, VACUUM the table to integrate the pending list into the main index.
An alternative approach is to use partitioning and load a partition at once. Then create the index on the partition and attach it to the partitioned table.

Related

Postgres Upsert vs Truncate and Insert

I have a stream of data that I can replay any time to reload data into a Postgres table. Lets say I have millions of rows in my table and I add a new column. Now I can replay that stream of data to map a key in the data to the column name that I have just added.
The two options I have are:
1) Truncate and then Insert
2) Upsert
Which would be a better option in terms of performance?
The way PostgreSQL does multiversioning, every update creates a new row version. The old row version will have to be reclaimed later.
This means extra work and tables with a lot of empty space in them.
On the other hand, TRUNCATE just throws away the old table, which is very fast.
You can gain extra performance by using COPY instead of INSERT to load bigger amounts of data.

Redshift query a daily-generated table

I am looking for a way to create a Redshift query that will retrieve data from a table that is generated daily. Tables in our cluster are of the form:
event_table_2016_06_14
event_table_2016_06_13
.. and so on.
I have tried writing a query that appends the current date to the table name, but this does not seem to work correctly (invalid operation):
SELECT * FROM concat('event_table_', to_char(getdate(),'YYYY_MM_DD'))
Any suggestions on how this can be performed are greatly appreciated!
I have tried writing a query that appends the current date to the
table name, but this does not seem to work correctly (invalid
operation):
Redshift does not support that. But you most likely won't need it.
Try the following (expanding on the answer from #ketan):
Create your main table with appropriate (for joins) DIST key, and COMPOUND or simple SORT KEY on timestamp column, and proper compression on columns.
Daily, create a temp table (use CREATE TABLE ... LIKE - this will preserve DIST/SORT keys), load it with daily data, VACUUM SORT.
Copy sorted temp table into main table using ALTER TABLE APPEND - this will copy the data sorted, and will reduce VACUUM on the main table. You may still need VACUUM SORT after that.
After that query your main table normally, probably giving it a range on timestamp. Redshift is optimised for these scenarios, and 99% of times you don't need to optimise table scans yourself - even on tables with billion of rows scans take milliseconds to few seconds. You may need to optimise elsewhere, but that's the second step.
To get insight in the performance of scans, use STL_QUERY system table to find your query ID, and then use STL_SCAN (or SVL_QUERY_SUMMARY) table to see how fast the scan was.
Your example is actually the main use case for ALTER TABLE APPEND.
I am assuming that you are creating a new table everyday.
What you can do is:
Create a view on top of event_table_* tables. Query your data using this view.
Whenever you create or drop a table, update the view.
If you want, you can avoid #2: Instead of creating a new table everyday, create empty tables for next 1-2 years. So, no need to update the view every day. However, do remember that there is an upper limit of 9,900 tables in Redshift.
Edit: If you always need to query today's table (instead of all tables, as I assumed originally), I don't think you can do that without updating your view.
However, you can modify your design to have just one table, with date as sort-key. So, whenever your table is queried with some date, all disk blocks that don't have that date will be skipped. That'll be as efficient as having time-series tables.

PostgreSQL - reindex when add new index

I have a table with 100k records without indexes. I created a new index on column that is used for left join.
Do I need to reindex my table?
Creation of an index took a few ms. So I am guessing that query can not use this index (no data) until I reindex my table (in case I would have other indexes I would reindex only index - I read the manual).
I can't find any information when new index is populated with data? Is this done automatically? When?
Once CREATE INDEX has finished, the index is ready to be used. There is no need to run REINDEX after that.
From the REINDEX documentation page:
REINDEX is similar to a drop and recreate of the index in that the index contents are rebuilt from scratch. However, the locking considerations are rather different. REINDEX locks out writes but not reads of the index's parent table.
That means REINDEX behaves similar to CREATE after DROP.
And from the CREATE INDEX documentation page:
Creating an index can interfere with regular operation of a database. Normally PostgreSQL locks the table to be indexed against writes and performs the entire index build with a single scan of the table. Other transactions can still read the table, but if they try to insert, update, or delete rows in the table they will block until the index build is finished.
I think this unambiguously explains that creation implies indexation.
Whether or not a specific query uses the index depends on many different things though. If your query doesn't use the index, you need to post the query, the table definitions (e.g. as a create table statement), the index you have defined and the output of explain (analyze, verbose) of your query.

Is the speed of a PostgreSQL SELECT adversely affected by too many indexes on the table?

I have read that when having a lot of indexes on a database It can seriously hurt the performance but in the PostgreSQL doc I can't find anything about it.
I have a very big table with something like 100 columns and a billion rows and often I have to do a lot of searches in a lot of different fields.
Does the performance of the PostgreSQL table will drop if I add a lot of indexes (maybe 10 unique column indexes and 5 or 7 3 column indexes)?
EDIT: With performance drop I mean the performance in fetching rows (select), the database will be updated once a month so the update and insert time are not an issue.
The indexes are maintained when the content of the table has been modified (i.e. INSERT, UPDATE, DELETE)
The query planner of PostgreSQL can decide when to use an index and when it's not needed and a sequential scan is more optimal.
So having too many indexes will hurt the modifying performance, not the fetching.
The indexes will have to be updated at each insert and update involving those columns.
I have some charts about that on my site: http://use-the-index-luke.com/sql/dml
An index is pure redundancy. It contains only data that is also stored
in the table. During write operations, the database must keep those
redundancies consistent. Specifically, it means that insert, delete
and update not only affect the table but also the indexes that hold a
copy of the affected data.
The chapter titles suggest the impact that indexes can have:
Insert — cannot take direct benefit from indexes
Delete — uses indexes for the where clause
Update — does not affect all indexes of the table

What is the command for Index optimization and update statistics for Oracle 10g and 11g?

I am Loading large no of rows into a table from a csv data file . For every 10000 records I want to update the indexs on the table for optimization (update statistics ). Any body tell me what is the command i can use? Also what is SQL Server "UPDATE STATISTICS" equivalent in Oracle.is Update statistics means index optimization or gatehring statistics. I am using Oracle 10g and 11g. Thanks in advance.
Index optimization is a tricky question. You can COALESCE an index to eliminate adjacent empty blocks, and you can REBUILD an index to completely trash and recreate it. In my opinion, what you may wish to do for the period of your data load, is make the indexes UNUSABLE, then when you're done, REBUILD them.
ALTER INDEX my_table_idx01 DISABLE;
-- run loader process
ALTER INDEX my_table_idx01 REBUILD;
You only want to gather statistics once when you're done, and that's done with a call to DBMS_STATS, like so:
EXEC DBMS_STATS.GATHER_TABLE_STATS ('my_schema', 'my_table');
I would recommend taking a different approach. I would drop the index(es), load the data and then recreate the index. After enabling it Oracle will build a good index on the data you just loaded. Two things are accomplished here, the records will load faster and the index will be rebuilt with a properly balanced tree. (Note: Be careful here, if the table is a really big table, you may need to declare a temporary tablespace for it to work in.)
drop index my_index;
-- uber awesome loading process
create index my_index on my_table(my_col1, my_col2);