Update and insert performance with partial indexes - postgresql

I have different queries for fetching data from a large table (about 100-200M rows). I've created partial indexes for my table with different predicates to fit the query because I know each query.
For example, the table similar to this:
CREATE TABLE public.contacts (
id int8 NOT NULL DEFAULT ssng_generate_id(8::bigint),
created timestamp NOT NULL DEFAULT timezone('UTC'::text, now()),
contact_pool_id int8 NOT NULL,
project_id int8 NOT NULL,
state_id int4 NOT NULL DEFAULT 10,
order_x int4 NOT NULL,
next_attempt_date timestamp NULL,
CONSTRAINT contacts_pkey PRIMARY KEY (id)
);
And there are two types of query:
SELECT * FROM contacts WHERE contact_pool_id = X AND state_id = 10 ORDER BY order_x LIMIT 1;
and
SELECT * FROM contacts WHERE contact_pool_id = X AND state_id = 20 AND next_attemp_date <= NOW ORDER BY next_attemp_date LIMIT 1;
For those queries I've created partial indexes:
For state_id = 10 (new contacts)
CREATE INDEX ix_contacts_cpid_orderx_id_for_new ON contacts USING btree (contact_pool_id, order_x, id) WHERE state_id = 10;
For state_id = 20 (available contacts)
CREATE INDEX ix_contacts_cpid_nextattepmdate_id_for_available ON contacts USING btree (contact_pool_id, next_attempt_date, id) WHERE state_id = 20;
For me, those partial indexes are faster than a single index.
And what about an update and insert performance? If I change a row with state_id = 20, will it affect only index 2 (for available contacts) or both of them will be affected?

Partial indexes which are not relevant to the tuple will not get updated.
If PostgreSQL can do a HOT update (if the column being changed is not part of an index, and there is room on the same page for the new tuple), then even the relevant index doesn't need to get updated.

Yes, with a partial index you only pay the overhead of modifying the index for rows that meet the WHERE condition, so you will always only need to modify at most one of the indexes at the same time (unless you change state_id from 10 to 20 or vice versa).

Related

Speed up heavy UPDATE..FROM..WHERE PostgreSQL query

I have 2 big tables
CREATE TABLE "public"."linkages" (
"supplierid" integer NOT NULL,
"articlenumber" character varying(32) NOT NULL,
"article_id" integer,
"vehicle_id" integer
);
CREATE INDEX "__linkages_datasupplierarticlenumber" ON "public"."__linkages" USING btree ("datasupplierarticlenumber");
CREATE INDEX "__linkages_supplierid" ON "public"."__linkages" USING btree ("supplierid");
having 215 000 000 records, and
CREATE TABLE "public"."article" (
"id" integer DEFAULT nextval('tecdoc_article_id_seq') NOT NULL,
"part_number" character varying(32),
"supplier_id" integer,
CONSTRAINT "tecdoc_article_part_number_supplier_id" UNIQUE ("part_number", "supplier_id")
) WITH (oids = false);
having 5 500 000 records.
I need to update linkages.article_id according article.part_number and article.supplier_id, like this:
UPDATE linkages
SET article_id = article.id
FROM
article
WHERE
linkages.supplierid = article.supplier_id AND
linkages.articlenumber = article.part_number;
But it is to heavy. I tried this, but it processed for a day with no result. So I had terminated it.
I need to do this update only once to normalize my datatable structure for using foreign keys in Django ORM. How can I resolve this issue?
Thanks a lot!

Efficient way to reconstruct base table from changes

I have a table consisting of products (with ID's, ~15k records) and another table price_changes (~88m records) recording a change in the price for a given productID at a given changedate.
I'm now interested in the price for each product at given points in time (say every 2 hours for a year, so altogether ~ 4300 points; altogether resulting in ~64m data points of interest). While it's very straight forward to determine the price for a given product at a given time, it seems to be quite time-consuming to determine all 64m data points.
My approach is to pre-populate a new target table fullprices with the data points of interest:
insert into fullprices(obsdate,productID)
select obsdate, productID from targetdates, products
and then update each price observation in this new table like this:
update fullprices f set price = (select price from price_changes where
productID = f.productID and date < f.obsdate
order by date desc
limit 1)
which should give me the most recent price change in each point in time.
Unfortunately, this takes ... well, ages. Is there any better way to do it?
== Edit: My tables are created as follows: ==
CREATE TABLE products
(
productID uuid NOT NULL,
name text NOT NULL,
CONSTRAINT products_pkey PRIMARY KEY (productID )
);
CREATE TABLE price_changes
(
id integer NOT NULL,
productID uuid NOT NULL,
price smallint,
date timestamp NOT NULL
);
CREATE INDEX idx_pc_date
ON price_changes USING btree
(date);
CREATE INDEX idx_pc_productID
ON price_changes USING btree
(productID);
CREATE TABLE targetdates
(
obsdate timestamp
);
CREATE TABLE fullprices
(
obsdate timestamp NOT NULL,
productID uuid NOT NULL,
price smallint
);

I'm trying to insert tuples into a table A (from table B) if the primary key of the table B tuple doesn't exist in tuple A

Here is what I have so far:
INSERT INTO Tenants (LeaseStartDate, LeaseExpirationDate, Rent, LeaseTenantSSN, RentOverdue)
SELECT CURRENT_DATE, NULL, NewRentPayments.Rent, NewRentPayments.LeaseTenantSSN, FALSE from NewRentPayments
WHERE NOT EXISTS (SELECT * FROM Tenants, NewRentPayments WHERE NewRentPayments.HouseID = Tenants.HouseID AND
NewRentPayments.ApartmentNumber = Tenants.ApartmentNumber)
So, HouseID and ApartmentNumber together make up the primary key. If there is a tuple in table B (NewRentPayments) that doesn't exist in table A (Tenants) based on the primary key, then it needs to be inserted into Tenants.
The problem is, when I run my query, it doesn't insert anything (I know for a fact there should be 1 tuple inserted). I'm at a loss, because it looks like it should work.
Thanks.
Your subquery was not correlated - It was just a non-correlated join query.
As per description of your problem, you don't need this join.
Try this:
insert into Tenants (LeaseStartDate, LeaseExpirationDate, Rent, LeaseTenantSSN, RentOverdue)
select current_date, null, p.Rent, p.LeaseTenantSSN, FALSE
from NewRentPayments p
where not exists (
select *
from Tenants t
where p.HouseID = t.HouseID
and p.ApartmentNumber = t.ApartmentNumber
)

Postgres sort expression

I have a table with goods:
CREATE TABLE public.goods (
"id" bigserial NOT NULL,
title varchar(250) NOT NULL,
cost numeric(10,2),
PRIMARY KEY ("id")
);
Now I want to sort this table by title but put all goods with cost 0 at the end of the list. Is this possible?
If I try to use:
ORDER BY
cost DESC,
title ASC
I get incorrect order by title
One way to do this is to use a CASE expression when ordering which places the block of records having a zero cost at the bottom. Then, within each block (either zero cost or non-zero cost), the records can be sorted alphabetically by the title.
SELECT cost, title
FROM public.goods
ORDER BY CASE WHEN cost = 0 THEN 1 ELSE 0 END,
title

an empty row with null-like values in not-null field

I'm using postgresql 9.0 beta 4.
After inserting a lot of data into a partitioned table, i found a weird thing. When I query the table, i can see an empty row with null-like values in 'not-null' fields.
That weird query result is like below.
689th row is empty. The first 3 fields, (stid, d, ticker), are composing primary key. So they should not be null. The query i used is this.
select * from st_daily2 where stid=267408 order by d
I can even do the group by on this data.
select stid, date_trunc('month', d) ym, count(*) from st_daily2
where stid=267408 group by stid, date_trunc('month', d)
The 'group by' results still has the empty row.
The 1st row is empty.
But if i query where 'stid' or 'd' is null, then it returns nothing.
Is this a bug of postgresql 9b4? Or some data corruption?
EDIT :
I added my table definition.
CREATE TABLE st_daily
(
stid integer NOT NULL,
d date NOT NULL,
ticker character varying(15) NOT NULL,
mp integer NOT NULL,
settlep double precision NOT NULL,
prft integer NOT NULL,
atr20 double precision NOT NULL,
upd timestamp with time zone,
ntrds double precision
)
WITH (
OIDS=FALSE
);
CREATE TABLE st_daily2
(
CONSTRAINT st_daily2_pk PRIMARY KEY (stid, d, ticker),
CONSTRAINT st_daily2_strgs_fk FOREIGN KEY (stid)
REFERENCES strgs (stid) MATCH SIMPLE
ON UPDATE CASCADE ON DELETE CASCADE,
CONSTRAINT st_daily2_ck CHECK (stid >= 200000 AND stid < 300000)
)
INHERITS (st_daily)
WITH (
OIDS=FALSE
);
The data in this table is simulation results. Multithreaded multiple simulation engines written in c# insert data into the database using Npgsql.
psql also shows the empty row.
You'd better leave a posting at http://www.postgresql.org/support/submitbug
Some questions:
Could you show use the table
definitions and constraints for the
partions?
How did you load your data?
You get the same result when using
another tool, like psql?
The answer to your problem may very well lie in your first sentence:
I'm using postgresql 9.0 beta 4.
Why would you do that? Upgrade to a stable release. Preferably the latest point-release of the current version.
This is 9.1.4 as of today.
I got to the same point: "what in the heck is that blank value?"
No, it's not a NULL, it's a -infinity.
To filter for such a row use:
WHERE
case when mytestcolumn = '-infinity'::timestamp or
mytestcolumn = 'infinity'::timestamp
then NULL else mytestcolumn end IS NULL
instead of:
WHERE mytestcolumn IS NULL