How do I update 1.3 billion rows in this table more efficiently? - postgresql

I have 1.3 billion rows in a PostgreSQL table sku_comparison that looks like this:
id1 (INTEGER) | id2 (INTEGER) | (10 SMALLINT columns) | length1 (SMALLINT)... |
... length2 (SMALLINT) | length_difference (SMALLINT)
The id1 and id2 columns are referenced in a table called sku, which contains about 300,000 rows, and have an associated varchar(25) value in each row from a column, code.
There is a btree index built on id1 and id2, and a compound index of id1 and id2 in sku_comparison. There is a btree index on the id column of sku, as well.
My goal is to update the length1 and length2 columns with the lengths of the corresponding code column from the sku table. However, I ran the following code for over 20 hours, and it did not complete the update:
UPDATE sku_comparison SET length1=length(sku.code) FROM sku
WHERE sku_comparison.id1=sku.id;
All of the data is stored on a single hard disk on a local computer, and the processor is fairly modern. Constructing this table, which required much more complicated string comparisons in Python, only took about 30 hours or so, so I am not sure why something like this would take as long.
edit: here are formatted table definitions:
Table "public.sku"
Column | Type | Modifiers
------------+-----------------------+--------------------------------------------------
id | integer | not null default nextval('sku_id_seq'::regclass)
sku | character varying(25) |
pattern | character varying(25) |
pattern_an | character varying(25) |
firsttwo | character(2) | default ' '::bpchar
reference | character varying(25) |
Indexes:
"sku_pkey" PRIMARY KEY, btree (id)
"sku_sku_idx" UNIQUE, btree (sku)
"sku_firstwo_idx" btree (firsttwo)
Referenced by:
TABLE "sku_comparison" CONSTRAINT "sku_comparison_id1_fkey" FOREIGN KEY (id1) REFERENCES sku(id)
TABLE "sku_comparison" CONSTRAINT "sku_comparison_id2_fkey" FOREIGN KEY (id2) REFERENCES sku(id)
Table "public.sku_comparison"
Column | Type | Modifiers
---------------------------+----------+-------------------------
id1 | integer | not null
id2 | integer | not null
consec_charmatch | smallint |
consec_groupmatch | smallint |
consec_fieldtypematch | smallint |
consec_groupmatch_an | smallint |
consec_fieldtypematch_an | smallint |
general_charmatch | smallint |
general_groupmatch | smallint |
general_fieldtypematch | smallint |
general_groupmatch_an | smallint |
general_fieldtypematch_an | smallint |
length1 | smallint | default 0
length2 | smallint | default 0
length_difference | smallint | default '-999'::integer
Indexes:
"sku_comparison_pkey" PRIMARY KEY, btree (id1, id2)
"ssd_id1_idx" btree (id1)
"ssd_id2_idx" btree (id2)
Foreign-key constraints:
"sku_comparison_id1_fkey" FOREIGN KEY (id1) REFERENCES sku(id)
"sku_comparison_id2_fkey" FOREIGN KEY (id2) REFERENCES sku(id)

Would you consider using an anonymous code block?
using pseudo code...
FOREACH 'SELECT ski.id,
sku.code,
length(sku.code)
FROM sku
INTO v_skuid, v_skucode, v_skulength'
DO
UPDATE sku_comparison
SET sku_comparison.length1 = v_skulength
WHERE sku_comparison.id1=v_skuid;
END DO
END FOREACH
This would break the whole thing into smaller transactions and you will not be evaluating the length of sku.code every time.

Related

"duplicate key value violates unique constraint" during updating non unique field

We are trying to move our application to new postgresql cluster.
during doing that we noticed that application threw exception like that:
[2017-06-02 14:43:34,530] ........ (psycopg2.IntegrityError) duplicate key value violates unique constraint "items_url"
DETAIL: Key (url)=(http://www.domainname.ru/ap_module/content/article/400-professional/140-professional/11880) already exists.
[SQL: 'UPDATE items SET status=%(status)s WHERE items.id IN ....
it's very strange because:
the application writes to items fields, not items_url. items_url is indexes by items, actually
UPDATE only changes status fields that hasn't flag unique and also it is not a primary key.
table items:
id | integer | not null default nextval(('public.items_id_seq'::text)::regclass)
ctime | timestamp without time zone | not null default now()
pubdate | timestamp without time zone | not null default now()
resource_id | integer | not null default 0
url | text |
title | text |
description | text |
body | text |
status | smallint | not null default 0
image | text |
orig_id | integer | not null default 0
mtime | timestamp without time zone | not null default now()
checksum | text |
video_url | text |
audio_url | text |
content_type | smallint | default 0
author | text |
video | text |
fulltext_status | smallint | default 0
summary | text |
image_id | integer |
video_id | integer |
priority | smallint |
Indexes:
"items_pkey" PRIMARY KEY, btree (id)
"items_url" UNIQUE, btree (url)
"items_resource_id" btree (resource_id)
"ndx__items__ctime" btree (ctime)
"ndx__items__image" btree (image_id)
"ndx__items__mtime" btree (mtime)
"ndx__items__pubdate" btree (pubdate)
"ndx__items__video" btree (video_id)
Foreign-key constraints:
"items_fkey1" FOREIGN KEY (image_id) REFERENCES images(id) ON UPDATE CASCADE ON DELETE SET NULL
"items_fkey2" FOREIGN KEY (video_id) REFERENCES videos(id) ON UPDATE CASCADE ON DELETE SET NUL
Well, the question is why it happens and how can I troubleshoot this?
Thank you.
UPD1:
I tried to reproduce it on 9.4. - reproduced
Played with client_encoding. Encoding everywhere is the same.

Is trigger or alias returning data in Postgres?

Postgres 9.2.2, Django 1.6.5, Python 2.7.5
Column | Type | Modifiers
---------------+------------------------+------------------------------------
id | integer | not null default nextval('employee_id_seq'::regclass)
first_name | character varying(75) | not null
surname | character varying(75) | not null
mname | character varying(75) |
date_of_birth | date |
staff_id | character varying(20) |
img1 | character varying(100) |
slug | character varying(50) |
created | date | not null
modified | date | not null
ppsn | character varying(20) |
Indexes:
"employee_pkey" PRIMARY KEY, btree (id)
"employee_slug_key" UNIQUE, btree (slug)
"employee_slug_like" btree (slug varchar_pattern_ops)
Referenced by:
TABLE "employeedynamic" CONSTRAINT "employee_id_refs_id_71e22023" FOREIGN KEY (employee_id) REFERENCES employee(id) DEFERRABLE INITIAL
LY DEFERRED
TABLE "drvliclicence" CONSTRAINT "employee_id_refs_id_afc65012" FOREIGN KEY (employee_id) REFERENCES employee(id) DEFERRABLE INITIALLY
DEFERRED
TABLE "coursedetail_attendance" CONSTRAINT "employee_id_refs_id_c8466b5f" FOREIGN KEY (employee_id) REFERENCES employee(id) DEFERRABLE
INITIALLY DEFERRED
.
=# select a.name from employee a where a.id = 366;
(366,Tommy,Gibbons,"",1800-08-21,1002180,images/GibbonsT2010_1.
(1 row)
Problem: How is a.name returning these details?
I've tried looking up aliases and triggers but I cannot figure this.
Check trigger:
=# \dft
List of functions
Schema | Name | Result data type | Argument data types | Type
--------+------+------------------+---------------------+------
(0 rows)
or
=# SELECT tgname FROM pg_trigger, pg_class WHERE tgrelid=pg_class.oid and relname = 'employee';
RI_ConstraintTrigger_101722
RI_ConstraintTrigger_101723
RI_ConstraintTrigger_101732
RI_ConstraintTrigger_101733
RI_ConstraintTrigger_101737
RI_ConstraintTrigger_101738
(6 rows)
Any idea how I can find what is causing a.name to return data?

Why evolution in Play Framework doesn't work?

I'm using Play 2.3 and trying to generate relational database by evolution for PostgreSQL 9.4.
I have following statements in my conf/evolutions/default/1.sql script:
ALTER TABLE ONLY round
ADD CONSTRAINT round_event_id_fkey FOREIGN KEY (event_id) REFERENCES event(id);
ALTER TABLE ONLY round
ADD CONSTRAINT round_event_id UNIQUE (event_id);
Following is my event table description:
Table "public.event"
Column | Type | Modifiers
-------------------------------+-----------------------------+---------------------------------------------------- id | integer | not null default nextval('event_id_seq'::regclass) related_event_hash | character varying(45) | start_time | timestamp without time zone | end_time | timestamp without time zone | name | character varying(45) | status | character varying(45) | not null owner_id | bigint | not null venue_id | bigint | participation_hash | character varying(45) | number_of_participants | integer | number_of_backup_participants | integer | created | timestamp without time zone | not null updated | timestamp without time zone | not null Indexes:
"event_pkey" PRIMARY KEY, btree (id)
"index_event_name" btree (name)
"index_event_status" btree (status)
"index_start_time" btree (start_time) Foreign-key constraints:
"event_owner_id_fkey" FOREIGN KEY (owner_id) REFERENCES person(id)
"event_venue_id_fkey" FOREIGN KEY (venue_id) REFERENCES venue(id) Referenced by:
TABLE "anonymous_person" CONSTRAINT "anonymous_person_event_id_fkey" FOREIGN KEY (event_id) REFERENCES event(id)
TABLE "mix_game" CONSTRAINT "mix_game_event_id_fkey" FOREIGN KEY (event_id) REFERENCES event(id)
TABLE "participant" CONSTRAINT "participant_event_id_fkey" FOREIGN KEY (event_id) REFERENCES event(id)
When I start the application in a browser I get this error:
Database 'default' is in an inconsistent state!
While trying to run this SQL script, we got the following error:
ERROR: there is no unique constraint matching given keys for referenced table "round" [ERROR:0, SQLSTATE:42830]
What could be wrong? How to fix this error and add foreign key constraints?
Note that it generates database round as follows without foreign key constraints.
Table "public.round"
Column | Type | Modifiers
------------------+-----------------------+----------------------------------------------------
id | integer | not null default nextval('round_id_seq'::regclass)
round_no | integer | not null
event_id | bigint | not null
state | character varying(20) | not null
team_composition | character(12) | not null
result | character varying(20) |
description | character varying(45) |
play_time | integer | not null
shift_time | integer |
change_time | integer |
Indexes:
"round_pkey" PRIMARY KEY, btree (id)
"round_event_id" UNIQUE CONSTRAINT, btree (event_id)
Take a look at the documentation.
As you see you have to delimit the both Ups and Downs section by using
comments in your SQL script.
Also, do not edit the 1.sql file because it is updated by the evolution mechanism. Start your own evolutions at 2.sql.

Optimizing PostgreSQL query with slow ORDER BY

I'm trying to get this query to run faster. It seems like sorting by the quality field is what really slows it down (the table has about 5 million rows) - maybe there is an index I can use for that?
Query:
SELECT "connectr_twitterpassage"."id", "connectr_twitterpassage"."third_party_id", "connectr_twitterpassage"."third_party_created", "connectr_twitterpassage"."source", "connectr_twitterpassage"."text", "connectr_twitterpassage"."author", "connectr_twitterpassage"."raw_data", "connectr_twitterpassage"."retweet_count", "connectr_twitterpassage"."favorited_count", "connectr_twitterpassage"."lang", "connectr_twitterpassage"."location", "connectr_twitterpassage"."author_followers_count", "connectr_twitterpassage"."is_retweet", "connectr_twitterpassage"."url", "connectr_twitterpassage"."author_fk_id", "connectr_twitterpassage"."quality", "connectr_twitterpassage"."is_top_tweet", "connectr_twitterpassage"."created", "connectr_twitterpassage"."modified"
FROM "connectr_twitterpassage"
INNER JOIN "connectr_twitterpassage_words" ON ("connectr_twitterpassage"."id" = "connectr_twitterpassage_words"."twitterpassage_id")
WHERE "connectr_twitterpassage_words"."word_id" = 18974807
ORDER BY "connectr_twitterpassage"."quality"
DESC LIMIT 100
Here is the EXPLAIN ANALYZE: http://explain.depesz.com/s/7zb
And the table definitions:
\d connectr_twitterpassage
Column | Type | Modifiers
------------------------+--------------------------+----------------------------------------------------------------------
id
| integer | not null default nextval('connectr_twitterpassage_id_seq'::regclass)
third_party_id | character varying(10000) | not null
source | character varying(10000) | not null
text | character varying(10000) | not null
author | character varying(10000) | not null
raw_data | character varying(10000) | not null
created | timestamp with time zone | not null
modified | timestamp with time zone | not null
third_party_created | timestamp with time zone |
retweet_count | integer | not null
favorited_count | integer | not null
lang | character varying(10000) | not null
location | character varying(10000) | not null
author_followers_count | integer | not null
is_retweet | boolean | not null
url | character varying(10000) | not null
author_fk_id | integer |
quality | bigint |
is_top_tweet | boolean | not null
Indexes:
"connectr_passage_pkey" PRIMARY KEY, btree (id)
"connectr_twitterpassage_third_party_id_uniq" UNIQUE CONSTRAINT, btree (third_party_id)
"connectr_passage_author_followers_count" btree (author_followers_count)
"connectr_passage_favorited_count" btree (favorited_count)
"connectr_passage_retweet_count" btree (retweet_count)
"connectr_passage_source" btree (source)
"connectr_passage_source_like" btree (source varchar_pattern_ops)
"connectr_passage_third_party_id" btree (third_party_id)
"connectr_passage_third_party_id_like" btree (third_party_id varchar_pattern_ops)
"connectr_twitterpassage_author_fk_id" btree (author_fk_id)
"connectr_twitterpassage_created" btree (created)
"connectr_twitterpassage_is_top_tweet" btree (is_top_tweet)
"connectr_twitterpassage_quality" btree (quality)
"connectr_twitterpassage_third_party_created" btree (third_party_created)
"id_to_quality_sorted" btree (id, quality DESC NULLS LAST)
Foreign-key constraints:
"author_fk_id_refs_id_074720a5" FOREIGN KEY (author_fk_id) REFERENCES connectr_twitteruser(id) DEFERRABLE INITIALLY DEFERRED
Referenced by:
TABLE "connectr_passageviewevent" CONSTRAINT "passage_id_refs_id_892b36a6" FOREIGN KEY (passage_id) REFERENCES connectr_twitterpassage(id) DEFERRABLE INITIALLY DEFERRED
TABLE "connectr_connection" CONSTRAINT "twitter_from_id_refs_id_8adbab24" FOREIGN KEY (twitter_from_id) REFERENCES connectr_twitterpassage(id) DEFERRABLE INITIALLY DEFERRED
TABLE "connectr_connection" CONSTRAINT "twitter_to_id_refs_id_8adbab24" FOREIGN KEY (twitter_to_id) REFERENCES connectr_twitterpassage(id) DEFERRABLE INITIALLY DEFERRED
TABLE "connectr_twitterpassage_words" CONSTRAINT "twitterpassage_id_refs_id_720f772f" FOREIGN KEY (twitterpassage_id) REFERENCES connectr_twitterpassage(id) DEFERRABLE INITIALLY DEFERRED
connectr=# \d connectr_twitterpassage_words
Table "public.connectr_twitterpassage_words"
Column | Type | Modifiers
-------------------+---------+----------------------------------------------------------------------------
id | integer | not null default nextval('connectr_twitterpassage_words_id_seq'::regclass)
twitterpassage_id | integer | not null
word_id | integer | not null
Indexes:
"connectr_twitterpassage_words_pkey" PRIMARY KEY, btree (id)
"connectr_twitterpassage_twitterpassage_id_613c80271f09fba8_uniq" UNIQUE CONSTRAINT, btree (twitterpassage_id, word_id)
"connectr_twitterpassage_words_twitterpassage_id" btree (twitterpassage_id)
"connectr_twitterpassage_words_word_id" btree (word_id)
"word_to_twitterpassage_id" btree (word_id, twitterpassage_id)
Foreign-key constraints:
"twitterpassage_id_refs_id_720f772f" FOREIGN KEY (twitterpassage_id) REFERENCES connectr_twitterpassage(id) DEFERRABLE INITIALLY DEFERRED
"word_id_refs_id_64f49629" FOREIGN KEY (word_id) REFERENCES connectr_word(id) DEFERRABLE INITIALLY DEFERRED
connectr=# \d connectr_word
Table "public.connectr_word"
Column | Type | Modifiers
---------------------+--------------------------+------------------------------------------------------------
id | integer | not null default nextval('connectr_word_id_seq'::regclass)
word | character varying(10000) | not null
created | timestamp with time zone | not null
modified | timestamp with time zone | not null
frequency | double precision |
is_username | boolean | not null
is_hashtag | boolean | not null
cloud_eligible | boolean | not null
passage_count | integer |
avg_quality | double precision |
last_twitter_search | timestamp with time zone |
cloud_approved | boolean | not null
display_word | character varying(10000) | not null
is_trend | boolean | not null
Indexes:
"connectr_word_pkey" PRIMARY KEY, btree (id)
"connectr_word_word_uniq" UNIQUE CONSTRAINT, btree (word)
"connectr_word_avg_quality" btree (avg_quality)
"connectr_word_cloud_eligible" btree (cloud_eligible)
"connectr_word_last_twitter_search" btree (last_twitter_search)
"connectr_word_passage_count" btree (passage_count)
"connectr_word_word" btree (word)
Referenced by:
TABLE "connectr_passageviewevent" CONSTRAINT "source_word_id_refs_id_178d46eb" FOREIGN KEY (source_word_id) REFERENCES connectr_word(id) DEFERRABLE INITIALLY DEFERRED
TABLE "connectr_wordmatchrewardevent" CONSTRAINT "tapped_word_id_refs_id_c2ffb369" FOREIGN KEY (tapped_word_id) REFERENCES connectr_word(id) DEFERRABLE INITIALLY DEFERRED
TABLE "connectr_connection" CONSTRAINT "word_id_refs_id_00cccde2" FOREIGN KEY (word_id) REFERENCES connectr_word(id) DEFERRABLE INITIALLY DEFERRED
TABLE "connectr_twitterpassage_words" CONSTRAINT "word_id_refs_id_64f49629" FOREIGN KEY (word_id) REFERENCES connectr_word(id) DEFERRABLE INITIALLY DEFERRED
Looking at the explain output, the sort is taking very little of the time. It is gathering the data it needs to sort that takes the time.
You must be spending a bit of time hitting the disk. If you could get your data better cached, that should speed it up a lot using the same query.
Otherwise, your best bet may be to denormalize the data by adding the quality field to the connectr_twitterpassage_words table and having an index on (word_id, quality,...)

reference to a sequence column (postgresql)

I encountered a problem when creating a foreign key referencing to a sequence, see the code example below.
But on creating the tables i recieve the following error.
"Detail: Key columns "product" and "id" are of incompatible types: integer and ownseq"
I've already tried different datatypes for the product column (like smallint, bigint) but none of them is accepted.
CREATE SEQUENCE ownseq INCREMENET BY 1 MINVALUE 100 MAXVALUE 99999;
CREATE TABLE products (
id ownseq PRIMARY KEY,
...);
CREATE TABLE basket (
basket_id SERIAL PRIMARY KEY,
product INTEGER FOREIGN KEY REFERENCES products(id));
CREATE SEQUENCE ownseq INCREMENT BY 1 MINVALUE 100 MAXVALUE 99999;
CREATE TABLE products (
id integer PRIMARY KEY default nextval('ownseq'),
...
);
alter sequence ownseq owned by products.id;
The key change is that id is defined as an integer, rather than as ownseq. This is what would happen if you used the SERIAL pseudo-type to create the sequence.
Try
CREATE TABLE products (
id INTEGER DEFAULT nextval(('ownseq'::text)::regclass) NOT NULL PRIMARY KEY,
...);
or don't create the sequence ownseq and let postgres do it for you:
CREATE TABLE products (
id SERIAL NOT NULL PRIMARY KEY
...);
In the above case the name of the sequence postgres has create should be products_id_seq.
Hope this helps.
PostgreSQL is powerful and you have just been bitten by an advanced feature.
Your DDL is quite valid but not at all what you think it is.
A sequence can be thought of as an extra-transactional simple table used for generating next values for some columns.
What you meant to do
You meant to have the id field defined thus, as per the other answer:
id integer PRIMARY KEY default nextval('ownseq'),
What you did
What you did was actually define a nested data structure for your table. Suppose I create a test sequence:
CREATE SEQUENCE testseq;
Then suppose I \d testseq on Pg 9.1, I get:
Sequence "public.testseq"
Column | Type | Value
---------------+---------+---------------------
sequence_name | name | testseq
last_value | bigint | 1
start_value | bigint | 1
increment_by | bigint | 1
max_value | bigint | 9223372036854775807
min_value | bigint | 1
cache_value | bigint | 1
log_cnt | bigint | 0
is_cycled | boolean | f
is_called | boolean | f
This is the definition of the type the sequence used.
Now suppose I:
create table seqtest (test testseq, id serial);
I can insert into it:
INSERT INTO seqtest (id, test) values (default, '("testseq",3,4,1,133445,1,1,0,f,f)');
I can then select from it:
select * from seqtest;
test | id
----------------------------------+----
(testseq,3,4,1,133445,1,1,0,f,f) | 2
Moreover I can expand test:
SELECT (test).* from seqtest;
select (test).* from seqtest;
sequence_name | last_value | start_value | increment_by | max_value | min_value
| cache_value | log_cnt | is_cycled | is_called
---------------+------------+-------------+--------------+-----------+----------
-+-------------+---------+-----------+-----------
| | | | |
| | | |
testseq | 3 | 4 | 1 | 133445 | 1
| 1 | 0 | f | f
(2 rows)
This sort of thing is actually very powerful in PostgreSQL but full of unexpected corners (for example not null and check constraints don't work as expected with nested data types). I don't generally recommend nested data types, but it is worth knowing that PostgreSQL can do this and will be happy to accept SQL commands to do it without warning.