Speed up heavy UPDATE..FROM..WHERE PostgreSQL query - postgresql

I have 2 big tables
CREATE TABLE "public"."linkages" (
"supplierid" integer NOT NULL,
"articlenumber" character varying(32) NOT NULL,
"article_id" integer,
"vehicle_id" integer
);
CREATE INDEX "__linkages_datasupplierarticlenumber" ON "public"."__linkages" USING btree ("datasupplierarticlenumber");
CREATE INDEX "__linkages_supplierid" ON "public"."__linkages" USING btree ("supplierid");
having 215 000 000 records, and
CREATE TABLE "public"."article" (
"id" integer DEFAULT nextval('tecdoc_article_id_seq') NOT NULL,
"part_number" character varying(32),
"supplier_id" integer,
CONSTRAINT "tecdoc_article_part_number_supplier_id" UNIQUE ("part_number", "supplier_id")
) WITH (oids = false);
having 5 500 000 records.
I need to update linkages.article_id according article.part_number and article.supplier_id, like this:
UPDATE linkages
SET article_id = article.id
FROM
article
WHERE
linkages.supplierid = article.supplier_id AND
linkages.articlenumber = article.part_number;
But it is to heavy. I tried this, but it processed for a day with no result. So I had terminated it.
I need to do this update only once to normalize my datatable structure for using foreign keys in Django ORM. How can I resolve this issue?
Thanks a lot!

Related

Update and insert performance with partial indexes

I have different queries for fetching data from a large table (about 100-200M rows). I've created partial indexes for my table with different predicates to fit the query because I know each query.
For example, the table similar to this:
CREATE TABLE public.contacts (
id int8 NOT NULL DEFAULT ssng_generate_id(8::bigint),
created timestamp NOT NULL DEFAULT timezone('UTC'::text, now()),
contact_pool_id int8 NOT NULL,
project_id int8 NOT NULL,
state_id int4 NOT NULL DEFAULT 10,
order_x int4 NOT NULL,
next_attempt_date timestamp NULL,
CONSTRAINT contacts_pkey PRIMARY KEY (id)
);
And there are two types of query:
SELECT * FROM contacts WHERE contact_pool_id = X AND state_id = 10 ORDER BY order_x LIMIT 1;
and
SELECT * FROM contacts WHERE contact_pool_id = X AND state_id = 20 AND next_attemp_date <= NOW ORDER BY next_attemp_date LIMIT 1;
For those queries I've created partial indexes:
For state_id = 10 (new contacts)
CREATE INDEX ix_contacts_cpid_orderx_id_for_new ON contacts USING btree (contact_pool_id, order_x, id) WHERE state_id = 10;
For state_id = 20 (available contacts)
CREATE INDEX ix_contacts_cpid_nextattepmdate_id_for_available ON contacts USING btree (contact_pool_id, next_attempt_date, id) WHERE state_id = 20;
For me, those partial indexes are faster than a single index.
And what about an update and insert performance? If I change a row with state_id = 20, will it affect only index 2 (for available contacts) or both of them will be affected?
Partial indexes which are not relevant to the tuple will not get updated.
If PostgreSQL can do a HOT update (if the column being changed is not part of an index, and there is room on the same page for the new tuple), then even the relevant index doesn't need to get updated.
Yes, with a partial index you only pay the overhead of modifying the index for rows that meet the WHERE condition, so you will always only need to modify at most one of the indexes at the same time (unless you change state_id from 10 to 20 or vice versa).

T-SQL data type for 39-digit number?

I'm using the free IPv6 data table from Ip2Location. The table schema is defined as:
CREATE TABLE [ip2location].[dbo].[ip2location_db11_ipv6](
[ip_from] char(39) NOT NULL,
[ip_to] char(39) NOT NULL,
[country_code] nvarchar(2) NOT NULL,
[country_name] nvarchar(64) NOT NULL,
[region_name] nvarchar(128) NOT NULL,
[city_name] nvarchar(128) NOT NULL,
[latitude] float NOT NULL,
[longitude] float NOT NULL,
[zip_code] nvarchar(30) NOT NULL,
[time_zone] nvarchar(8) NOT NULL
) ON [PRIMARY]
GO
CREATE INDEX [ip_from] ON [ip2location].[dbo].[ip2location_db11_ipv6]([ip_from]) ON [PRIMARY]
GO
CREATE INDEX [ip_to] ON [ip2location].[dbo].[ip2location_db11_ipv6]([ip_to]) ON [PRIMARY]
GO
Table queries are done with a converted IPv6 address to an IpNumber. The provided search query is:
SELECT *
FROM ip2location_db11_ipv6
WHERE #IpNumber BETWEEN ip_from AND ip_to
My average query time based on a few random look ups was ~1.2 seconds. While there are 3,458,959 records in the table, this seems a bit slow to me (is it? I'm by no means a SQL guru). My first thought was to make the ip_from and ip_to columns a numeric data type, but the maximum value is 58569107375850417935858934690443427840 (39 digits), which falls outside of the maximum range for a DECIMAL type. Is there anything that can be done to improve the query time for this?
You may try to use BINARY(17), 39 decimals fits into 17 bytes. Binary type doc.

Efficient way to reconstruct base table from changes

I have a table consisting of products (with ID's, ~15k records) and another table price_changes (~88m records) recording a change in the price for a given productID at a given changedate.
I'm now interested in the price for each product at given points in time (say every 2 hours for a year, so altogether ~ 4300 points; altogether resulting in ~64m data points of interest). While it's very straight forward to determine the price for a given product at a given time, it seems to be quite time-consuming to determine all 64m data points.
My approach is to pre-populate a new target table fullprices with the data points of interest:
insert into fullprices(obsdate,productID)
select obsdate, productID from targetdates, products
and then update each price observation in this new table like this:
update fullprices f set price = (select price from price_changes where
productID = f.productID and date < f.obsdate
order by date desc
limit 1)
which should give me the most recent price change in each point in time.
Unfortunately, this takes ... well, ages. Is there any better way to do it?
== Edit: My tables are created as follows: ==
CREATE TABLE products
(
productID uuid NOT NULL,
name text NOT NULL,
CONSTRAINT products_pkey PRIMARY KEY (productID )
);
CREATE TABLE price_changes
(
id integer NOT NULL,
productID uuid NOT NULL,
price smallint,
date timestamp NOT NULL
);
CREATE INDEX idx_pc_date
ON price_changes USING btree
(date);
CREATE INDEX idx_pc_productID
ON price_changes USING btree
(productID);
CREATE TABLE targetdates
(
obsdate timestamp
);
CREATE TABLE fullprices
(
obsdate timestamp NOT NULL,
productID uuid NOT NULL,
price smallint
);

my SQL query runs slow need to optimise

I'm having issues with the below query for my iPhone app. When the app runs the query it takes quite a while to process the result, maybe around a second or so... I was wondering if the query can be optimised in anyway? I'm using the FMDB framework to proces all my SQL.
select pd.discounttypeid, pd.productdiscountid, pd.quantity, pd.value, p.name, p.price, pi.path
from productdeals as pd, product as p, productimages as pi
where pd.productid = 53252
and pd.discounttypeid == 8769
and pd.productdiscountid = p.parentproductid
and pd.productdiscountid = pi.productid
and pi.type = 362
order by pd.id
limit 1
My statements are below for the tables:
CREATE TABLE "ProductImages" (
"ProductID" INTEGER,
"Type" INTEGER,
"Path" TEXT
)
CREATE TABLE "Product" (
"ProductID" INTEGER PRIMARY KEY,
"ParentProductID" INTEGER,
"levelType" INTEGER,
"SKU" TEXT,
"Name" TEXT,
"BrandID" INTEGER,
"Option1" INTEGER,
"Option2" INTEGER,
"Option3" INTEGER,
"Option4" INTEGER,
"Option5" INTEGER,
"Price" NUMERIC,
"RRP" NUMERIC,
"averageRating" INTEGER,
"publishedDate" DateTime,
"salesLastWeek" INTEGER
)
CREATE TABLE "ProductDeals" (
"ID" INTEGER,
"ProductID" INTEGER,
"DiscountTypeID" INTEGER,
"ProductDiscountID" INTEGER,
"Quantity" INTEGER,
"Value" INTEGER
)
Do you have indexes on foreign key columns (productimages.productid and product.parentproductid), and the columns you use to find right product deal (productdeals.productid and productdeals.discounttypeid)? If not, that could be the cause of poor performance.
You can create them like this:
CREATE INDEX idx_images_productid ON productimages(productid);
CREATE INDEX idx_products_parentid ON products(parentproductid);
CREATE INDEX idx_deals ON productdeals(productid, discounttypeid);
The below query could help you out to reduce the execution time, moreover try to create the indexes the fields correctly to fasten your query.
select pd.discounttypeid, pd.productdiscountid, pd.quantity, pd.value, p.name,
p.price, pi.path from productdeals pd join product p on pd.productdiscountid =
p.parentproductid join productimages pi on pd.productdiscountid = pi.productid where
pd.productid = 53252 and pd.discounttypeid = 8769 and pi.type = 362 order by pd.id
limit 1
Thanks

an empty row with null-like values in not-null field

I'm using postgresql 9.0 beta 4.
After inserting a lot of data into a partitioned table, i found a weird thing. When I query the table, i can see an empty row with null-like values in 'not-null' fields.
That weird query result is like below.
689th row is empty. The first 3 fields, (stid, d, ticker), are composing primary key. So they should not be null. The query i used is this.
select * from st_daily2 where stid=267408 order by d
I can even do the group by on this data.
select stid, date_trunc('month', d) ym, count(*) from st_daily2
where stid=267408 group by stid, date_trunc('month', d)
The 'group by' results still has the empty row.
The 1st row is empty.
But if i query where 'stid' or 'd' is null, then it returns nothing.
Is this a bug of postgresql 9b4? Or some data corruption?
EDIT :
I added my table definition.
CREATE TABLE st_daily
(
stid integer NOT NULL,
d date NOT NULL,
ticker character varying(15) NOT NULL,
mp integer NOT NULL,
settlep double precision NOT NULL,
prft integer NOT NULL,
atr20 double precision NOT NULL,
upd timestamp with time zone,
ntrds double precision
)
WITH (
OIDS=FALSE
);
CREATE TABLE st_daily2
(
CONSTRAINT st_daily2_pk PRIMARY KEY (stid, d, ticker),
CONSTRAINT st_daily2_strgs_fk FOREIGN KEY (stid)
REFERENCES strgs (stid) MATCH SIMPLE
ON UPDATE CASCADE ON DELETE CASCADE,
CONSTRAINT st_daily2_ck CHECK (stid >= 200000 AND stid < 300000)
)
INHERITS (st_daily)
WITH (
OIDS=FALSE
);
The data in this table is simulation results. Multithreaded multiple simulation engines written in c# insert data into the database using Npgsql.
psql also shows the empty row.
You'd better leave a posting at http://www.postgresql.org/support/submitbug
Some questions:
Could you show use the table
definitions and constraints for the
partions?
How did you load your data?
You get the same result when using
another tool, like psql?
The answer to your problem may very well lie in your first sentence:
I'm using postgresql 9.0 beta 4.
Why would you do that? Upgrade to a stable release. Preferably the latest point-release of the current version.
This is 9.1.4 as of today.
I got to the same point: "what in the heck is that blank value?"
No, it's not a NULL, it's a -infinity.
To filter for such a row use:
WHERE
case when mytestcolumn = '-infinity'::timestamp or
mytestcolumn = 'infinity'::timestamp
then NULL else mytestcolumn end IS NULL
instead of:
WHERE mytestcolumn IS NULL