Optimizing Postgres Count For Aggregated Select - postgresql

I have a query that is intended to retrieve the counts of each grouped product like so
SELECT
product_name,
product_color,
(array_agg("product_distributor"))[1] AS "product_distributor",
(array_agg("product_release"))[1] AS "product_release",
COUNT(*) AS "count"
FROM
product
WHERE
product.id IN (
SELECT
id
FROM
product
WHERE
(product_name ilike "%red%"
OR product_color ilike "%red%")
AND product_type = 1)
GROUP BY
product_name, product_color
LIMIT
1000
OFFSET
0
This query is run on the following table
Column | Type | Collation | Nullable | Default
---------------------+--------------------------+-----------+----------+---------
product_type | integer | | not null |
id | integer | | not null |
product_name | citext | | not null |
product_color | character varying(255) | | |
product_distributor | integer | | |
product_release | timestamp with time zone | | |
created_at | timestamp with time zone | | not null |
updated_at | timestamp with time zone | | not null |
Indexes:
"product_pkey" PRIMARY KEY, btree (id)
"product_distributer_index" btree (product_distributor)
"product_product_type_name_color" UNIQUE, btree (product_type, name, color)
"product_product_type_index" btree (product_type)
"product_name_color_index" btree (name, color)
Foreign-key constraints:
"product_product_type_fkey" FOREIGN KEY (product_type) REFERENCES product_type(id) ON UPDATE CASCADE ON DELETE CASCADE
"product_product_distributor_id" FOREIGN KEY (product_distributor) REFERENCES product_distributor(id)
How can I improve the performance of this query, specifically the COUNT(*) portion, which when removed improves the query but is requisite?

You may try using an INNER JOIN in place of a WHERE ... IN clause.
WITH selected_products AS (
SELECT id
FROM product
WHERE (product_name ilike "%red%" OR product_color ilike "%red%")
AND product_type = 1
)
SELECT product_name,
product_color,
(ARRAY_AGG("product_distributor"))[1] AS "product_distributor",
(ARRAY_AGG("product_release"))[1] AS "product_release",
COUNT(*) AS "count"
FROM product p
INNER JOIN selected_products sp
ON p.id = sp.id
GROUP BY product_name,
product_color
LIMIT 1000
OFFSET 0
Then create an index on the "product.id" field as follows:
CREATE INDEX product_ids_idx ON product USING HASH (id);

Related

PostgreSQL add SERIAL column to existing table with values based on ORDER BY

I have a large table (6+ million rows) that I'd like to add an auto-incrementing integer column sid, where sid is set on existing rows based on an ORDER BY inserted_at ASC. In other words, the oldest record based on inserted_at would be set to 1 and the latest record would be the total record count. Any tips on how I might approach this?
Add a sid column and UPDATE SET ... FROM ... WHERE:
UPDATE test
SET sid = t.rownum
FROM (SELECT id, row_number() OVER (ORDER BY inserted_at ASC) as rownum
FROM test) t
WHERE test.id = t.id
Note that this relies on there being a primary key, id.
(If your table did not already have a primary key, you would have to make one first.)
For example,
-- create test table
DROP TABLE IF EXISTS test;
CREATE TABLE test (
id int PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY
, foo text
, inserted_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);
INSERT INTO test (foo, inserted_at) VALUES
('XYZ', '2019-02-14 00:00:00-00')
, ('DEF', '2010-02-14 00:00:00-00')
, ('ABC', '2000-02-14 00:00:00-00');
-- +----+-----+------------------------+
-- | id | foo | inserted_at |
-- +----+-----+------------------------+
-- | 1 | XYZ | 2019-02-13 19:00:00-05 |
-- | 2 | DEF | 2010-02-13 19:00:00-05 |
-- | 3 | ABC | 2000-02-13 19:00:00-05 |
-- +----+-----+------------------------+
ALTER TABLE test ADD COLUMN sid INT;
UPDATE test
SET sid = t.rownum
FROM (SELECT id, row_number() OVER (ORDER BY inserted_at ASC) as rownum
FROM test) t
WHERE test.id = t.id
yields
+----+-----+------------------------+-----+
| id | foo | inserted_at | sid |
+----+-----+------------------------+-----+
| 3 | ABC | 2000-02-13 19:00:00-05 | 1 |
| 2 | DEF | 2010-02-13 19:00:00-05 | 2 |
| 1 | XYZ | 2019-02-13 19:00:00-05 | 3 |
+----+-----+------------------------+-----+
Finally, make sid SERIAL (or, better, an IDENTITY column):
ALTER TABLE test ALTER COLUMN sid SET NOT NULL;
-- IDENTITY fixes certain issue which may arise with SERIAL
ALTER TABLE test ALTER COLUMN sid ADD GENERATED BY DEFAULT AS IDENTITY;
-- ALTER TABLE test ALTER COLUMN sid SERIAL;

Select row position in filtered and ordered row list PostgreSQL

I got this query,
SELECT s.pos
FROM (SELECT t.guild_id, t.user_id
ROW_NUMBER() OVER(ORDER BY t.reputation DESC) AS pos
FROM users t) s
WHERE (s.guild_id, s.user_id) = ($2, $3)
that gets a user's "rank" in a guild, but I want to filter the results by entries that are in an array of t.user_id values (like {'1', '64', '83'}) and have this affect the resulting pos value. I found FILTER and WITHIN GROUP, but I'm not sure how to fit one of those into this query. How would I do that?
Here's the full table if that helps at all:
Table "public.users"
Column | Type | Collation | Nullable | Default
------------+-----------------------+-----------+----------+---------
guild_id | character varying(20) | | not null |
user_id | character varying(20) | | not null |
reputation | real | | not null | 0
Indexes:
"users_pkey" PRIMARY KEY, btree (guild_id, user_id)
Why not select on those first?
WITH UsersWeCareAbout AS (
SELECT * FROM users u WHERE u.user_id = ANY(subgroup_array)
), RepUsers AS (
SELECT t.guild_id, t.user_id, ROW_NUMBER() OVER(ORDER BY t.reputation DESC) AS pos
FROM UsersWeCareAbout t
) SELECT s.pos FROM RepUsers s WHERE (s.guild_id, s.user_id) = ($2, $3)
(untested if only because I didn't really have enough context to test with)

How do I update 1.3 billion rows in this table more efficiently?

I have 1.3 billion rows in a PostgreSQL table sku_comparison that looks like this:
id1 (INTEGER) | id2 (INTEGER) | (10 SMALLINT columns) | length1 (SMALLINT)... |
... length2 (SMALLINT) | length_difference (SMALLINT)
The id1 and id2 columns are referenced in a table called sku, which contains about 300,000 rows, and have an associated varchar(25) value in each row from a column, code.
There is a btree index built on id1 and id2, and a compound index of id1 and id2 in sku_comparison. There is a btree index on the id column of sku, as well.
My goal is to update the length1 and length2 columns with the lengths of the corresponding code column from the sku table. However, I ran the following code for over 20 hours, and it did not complete the update:
UPDATE sku_comparison SET length1=length(sku.code) FROM sku
WHERE sku_comparison.id1=sku.id;
All of the data is stored on a single hard disk on a local computer, and the processor is fairly modern. Constructing this table, which required much more complicated string comparisons in Python, only took about 30 hours or so, so I am not sure why something like this would take as long.
edit: here are formatted table definitions:
Table "public.sku"
Column | Type | Modifiers
------------+-----------------------+--------------------------------------------------
id | integer | not null default nextval('sku_id_seq'::regclass)
sku | character varying(25) |
pattern | character varying(25) |
pattern_an | character varying(25) |
firsttwo | character(2) | default ' '::bpchar
reference | character varying(25) |
Indexes:
"sku_pkey" PRIMARY KEY, btree (id)
"sku_sku_idx" UNIQUE, btree (sku)
"sku_firstwo_idx" btree (firsttwo)
Referenced by:
TABLE "sku_comparison" CONSTRAINT "sku_comparison_id1_fkey" FOREIGN KEY (id1) REFERENCES sku(id)
TABLE "sku_comparison" CONSTRAINT "sku_comparison_id2_fkey" FOREIGN KEY (id2) REFERENCES sku(id)
Table "public.sku_comparison"
Column | Type | Modifiers
---------------------------+----------+-------------------------
id1 | integer | not null
id2 | integer | not null
consec_charmatch | smallint |
consec_groupmatch | smallint |
consec_fieldtypematch | smallint |
consec_groupmatch_an | smallint |
consec_fieldtypematch_an | smallint |
general_charmatch | smallint |
general_groupmatch | smallint |
general_fieldtypematch | smallint |
general_groupmatch_an | smallint |
general_fieldtypematch_an | smallint |
length1 | smallint | default 0
length2 | smallint | default 0
length_difference | smallint | default '-999'::integer
Indexes:
"sku_comparison_pkey" PRIMARY KEY, btree (id1, id2)
"ssd_id1_idx" btree (id1)
"ssd_id2_idx" btree (id2)
Foreign-key constraints:
"sku_comparison_id1_fkey" FOREIGN KEY (id1) REFERENCES sku(id)
"sku_comparison_id2_fkey" FOREIGN KEY (id2) REFERENCES sku(id)
Would you consider using an anonymous code block?
using pseudo code...
FOREACH 'SELECT ski.id,
sku.code,
length(sku.code)
FROM sku
INTO v_skuid, v_skucode, v_skulength'
DO
UPDATE sku_comparison
SET sku_comparison.length1 = v_skulength
WHERE sku_comparison.id1=v_skuid;
END DO
END FOREACH
This would break the whole thing into smaller transactions and you will not be evaluating the length of sku.code every time.

How to use subquery In Sphinx's multi-valued attributes (MVA) for query type

My database is PostgreSQL, and I want to use Sphinx Search Engines to index my data;
How can i use sql_attr_multi to fetch the relationship data?
The tables in postgresql, schemas is:
crm=# \d orders
Table "orders"
Column | Type | Modifiers
-----------------+-----------------------------+----------------------------------------
id | bigint | not null
trade_id | bigint | not null
item_id | bigint | not null
price | numeric(10,2) | not null
total_amount | numeric(10,2) | not null
subject | character varying(255) | not null default ''::character varying
status | smallint | not null default 0
created_at | timestamp without time zone | not null default now()
updated_at | timestamp without time zone | not null default now()
Indexes:
"orders_pkey" PRIMARY KEY, btree (id)
"orders_trade_id_idx" btree (trade_id)
crm=# \d trades
Table "trades"
Column | Type | Modifiers
-----------------------+-----------------------------+---------------------------------------
id | bigint | not null
operator_id | bigint | not null
customer_id | bigint | not null
category_ids | bigint[] | not null
total_amount | numeric(10,2) | not null
discount_amount | numeric(10,2) | not null
created_at | timestamp without time zone | not null default now()
updated_at | timestamp without time zone | not null default now()
Indexes:
"trades_pkey" PRIMARY KEY, btree (id)
The Sphinx's config is:
source trades_src
{
type = pgsql
sql_host = 10.10.10.10
sql_user = ******
sql_pass = ******
sql_db = crm
sql_port = 5432
sql_query = \
SELECT id, operator_id, customer_id, category_ids, total_amount, discount_amount, \
date_part('epoch',created_at) AS created_at, \
date_part('epoch',updated_at) AS updated_at \
FROM public.trades;
#attributes
sql_attr_bigint = operator_id
sql_attr_bigint = customer_id
sql_attr_float = total_amount
sql_attr_float = discount_amount
sql_attr_multi = bigint category_ids from field category_ids
#sql_attr_multi = bigint order_ids from query; SELECT id FROM orders
#how can i add where condition is the query for orders? eg. WHERE trade_id = ?
sql_attr_timestamp = created_at
sql_attr_timestamp = updated_at
}
I used MVA (multi-valued attributes) on category_ids field, and it is the ARRAY type in Postgresql.
But I donot know How to define MVA on order_ids. It will be through subquery?
Copied from sphinx forum....
sql_attr_multi = bigint order_ids from query; SELECT trade_id,id FROM orders ORDER BY trade_id
The first column of the query is the sphinx 'document_id' (ie the id in main sql_query)
The second column is the value to insert into the MVA array for that document.
(The ORDER BY might not strictly be needed, but sphinx is much quicker at processing the data if ordered by document_id IIRC)

Checking fillfactor setting for tables and indexes

There is maybe some function to check the fillfactor for indexes and tables? I've tried already \d+ but have basic definition only, without fillfactor value:
Index "public.tab1_pkey"
Column | Type | Definition | Storage
--------+--------+------------+---------
id | bigint | id | plain
primary key, btree, for table "public.tab1"
For tables haven't found anything. If table was created with fillfactor other than default:
CREATE TABLE distributors (
did integer,
name varchar(40),
UNIQUE(name) WITH (fillfactor=70)
)
WITH (fillfactor=70);
Then \d+ distributorsshows non-standard fillfactor.
Table "public.distributors"
Column | Type | Modifiers | Storage | Stats target | Description
--------+-----------------------+-----------+----------+--------------+-------------
did | integer | | plain | |
name | character varying(40) | | extended | |
Indexes:
"distributors_name_key" UNIQUE CONSTRAINT, btree (name) WITH (fillfactor=70)
Has OIDs: no
Options: fillfactor=70
But maybe there is a way to get this value without parsing output?
You need to query the pg_class system table:
select t.relname as table_name,
t.reloptions
from pg_class t
join pg_namespace n on n.oid = t.relnamespace
where t.relname in ('tab11_pkey', 'tab1')
and n.nspname = 'public'
reloptions is an array, with each element containing one option=value definition. But it will be null for relations that have the default options.