I use pg_trgm extension to check similarity of a text column. I want to speed it up by using additional conditions, but without success. The speed is the same. Here is my example:
create table test (
id serial,
descr text,
yesno text,
truefalse boolean
);
insert into test SELECT generate_series(1,1000000) AS id,
md5(random()::text) AS descr ;
update test set yesno = 'yes' where id < 500000;
update test set yesno = 'no' where id > 499999;
update test set truefalse = true where id < 100000;
update test set truefalse = false where id > 99999;
CREATE INDEX test_trgm_idx ON test USING gist (descr gist_trgm_ops);
So, when I execute query, there is no difference whether or not I use where clause.
select descr <-> '65c141ee1fdeb269d2e393cb1d3e1c09'
as dist, descr, yesno, truefalse from test
where
yesno = 'yes'
and
truefalse = true
order by dist
limit 10;
Is this right?
After creating test data do an ANALYZE to make sure statistics are updated. Then you can use EXPLAIN to find out.
On my machine it does an index scan on test_trgm_idx to scan the rows in order so it can stop when the limit is reached. With the where is actually slightly more work because it has to scan more rows before the limit is reached thought the time difference is not noticable.
Related
I have a database table where I want to simply update a column's value for all records that have a certain value in another column. I can't seem to figure out how to get batch updates working using postgres queries.
The current table structure is like the following:
placed_orders (
order_id uuid,
source char(1), -- 'A', 'B', or 'C'
submitted char(1), -- 'Y' or 'N'
.
.
.
)
I want to simply update all placed_order rows' source to 'B' where submitted is 'Y', which usually would be as simple as running:
UPDATE placed_orders SET source='B' WHERE submitted='Y' AND source!='B';
The table is pretty big and has over 5 million records, and if I just ran the above query to update all records at once, I run into out of memory errors, so I'm now looking into the possibility of updating in batches.
The query below is what I have so far, I'm new to loops and declaring variables and changing its values:
DO $test$
DECLARE
batch_size int := 1000;
current_offset int := 0;
records_to_update int := 0;
BEGIN
SELECT COUNT(1) INTO records_to_update FROM placed_orders WHERE submitted='Y' AND source !='B';
FOR j IN 0..records_to_update BY batch_size LOOP
UPDATE placed_orders SET source = 'B' WHERE order_id IN (SELECT order_id FROM placed_orders WHERE submitted='Y' AND source!='B' ORDER BY order_id ASC OFFSET current_offset LIMIT batch_size);
current_offset := current_offset + batch_size;
END LOOP;
END $test$;
When running this query, I still get the memory error. Anyone know how to get it to update in true batches?
SUMMARY: I've two tables I want to derive info out of: family_values (family_name, item_regex) and product_ids (product_id) to be able to update the property family_name in a third.
Here the plan is to grab a json array from the small family_values table and use the column value item_regex to do a test match against the product_id for every row in product_ids.
MORE DETAILS: Importing static data from CSV to table of orders. But, in evaluating cost of goods and market value I'm needing to continuously determine family from a prefix regex (item_regex from family_values) match on the product_id.
On the client this looks like this:
const families = {
FOOBAR: 'Big Ogre',
FOOBA: 'Wood Elf',
FOO: 'Valkyrie'
};
// And to find family, and subsequently COGs and Market Value:
const findFamily = product_id => Object.keys(families).find(f => new RegExp('^' + f).test(product_id));
This is a huge hit for the client so I made a family_values table in PG to include a representative: family_name, item_regex, cogs, market_value.
Then, the product_ids has a list of only the products the app cares about (out of millions). This is actually used with an insert trigger 'on before' to ignore any CSV entries that aren't in the product_ids view. So, I guess after that the product_ids view could be taken out of the equation because the orders, after inserting readonly data, has its own matching product_id. It does NOT have family_name, so I still have the issue of determining that client-side.
PSUEDO CODE: update family column of orders with family_name from family_values regex match against orders.product_id
OR update the product_ids table with a new family column and use that with the existing on insert trigger (used to left pad zeros and normalize data right now). Now I'm thinking this may be just an update as suggested, but not real good with regex in PG. I'm a PG novice.
PROBLEM: But, I'm having a hangup in doing what I thought would be like a JS Array Find operation. The family_values have been sorted on the item_regex so that the most strict match would be on top, and therefor found first.
For example, with sorting we have:
family_values_array = [
{"family_name": "Big Ogre", "item_regex": "FOOBAR"},
{"family_name": "Wood Elf", "item_regex": "FOOBA"},
{"family_name": "Valkyrie", "item_regex": "FOO"}]
So, that the comparison of product_id of ^FOOBA would yield family "Wood Elf".
SOLUTION:
The solution I finally came about using was simply using concat to write out the front-anchored regex. It was so simple in the end. The key line I was missing is:
select * into family_value_row from iol.family_values
where lvl3_id = product_row.lvl3_id and product_row.product_id
like concat(item_regex, '%') limit 1;
Whole function:
create or replace function iol.populate_families () returns void as $$
declare
product_row record;
family_value_row record;
begin
for product_row in
select product_id, lvl3_id from iol.products
loop
-- family_name is what we want after finding the BEST match fr a product_id against item_regex
select * into family_value_row from iol.family_values
where lvl3_id = product_row.lvl3_id and product_row.product_id like concat(item_regex, '%') limit 1;
-- update family_name and value columns
update iol.products set
family_name = family_value_row.family_name,
cog_cents = family_value_row.cog_cents,
market_value_cents = family_value_row.market_value_cents
where product_id = product_row.product_id;
end loop;
end;
$$
LANGUAGE plpgsql;
Use concat as updated above:
select * into family_value_row from iol.family_values
where lvl3_id = product_row.lvl3_id and product_row.product_id
like concat(item_regex, '%') limit 1;
I want to dynamically filter through data based on condition, which is stored in specific column. This condition can change for every row.
For example I have a table my_table with couple of columns, one of them is called foo, where there are couple of filter conditions such as AND bar > 1 or in the next row AND bar > 2 or in the next row AND bar = 33.
I have a query which looks like:
SELECT something from somewhere
LEFT JOIN otherthing on some_condition
WHERE first_condition AND second_condition AND
here_i_want_dynamically_load_condition_from_my_table.foo
What is the correct way to do it? I have read some articles about dynamic queries, but I am not able to find a correct way.
This is impossible in pure SQL: at query time, the planner has to know your exact logic. Now, you can hide it away in a function (in pseudo-sql):
CREATE FUNCTION do_I_filter_or_not(some_id) RETURNS boolean AS '
BEGIN
value = select some_value from table where id = some_id
condition_type = SELECT ... --Query a condition type for this row
condition_value = SELECT ... --Query a condition value for this row
if condition_type = 'equals' and condition_value = value
return true
if condition_type = 'greater_than' and condition_value < value
return true
if condition_type = 'lower_than' and condition_value > value
return true
return false;
END;
'
LANGUAGE 'plpgsql';
And query it like this:
SELECT something
FROM somewhere
LEFT JOIN otherthing on some_condition
WHERE first_condition
AND second_condition
AND do_I_filter_or_not(somewhere.id)
Now the performance will be bad: you have to invoke that function potentially on every row in the query; triggering lots of subqueries.
Thinking about it, if you just want <, >, =, and you have a table (filter_criteria) describing for each id what the criteria is you can do it:
CREATE TABLE filter_criteria AS(
some_id integer,
equals_threshold integer,
greater_than_threshold integer,
lower_than_threshold integer
-- plus a check that two thresholds must be null, and one not null
)
INSERT INTO filter_criteria (1, null, 5, null); -- for > 5
And query like this:
SELECT something
FROM somewhere
LEFT JOIN otherthing on some_condition
LEFT JOIN filter_criteria USING (some_id)
WHERE first_condition
AND second_condition
AND COALESCE(bar = equals_threshold, true)
AND COALESCE(bar > greater_than_threshold, true)
AND COALESCE(bar < lower_than_threshold, true)
The COALESCEs are here to default to not filtering (AND true) if the threshold is missing (bar = equals_threshold will yield null instead of a boolean).
The planner has to know your exact logic at query time: now you're just doing 3 passes of filtering, with a =, <, > check each time. That'd still be more performant than idea #1 with all the subquerying.
I'm using PostgreSQL 9.2.9 and have the following problem.
There are function:
CREATE OR REPLACE FUNCTION report_children_without_place(text, date, date, integer)
RETURNS TABLE (department_name character varying, kindergarten_name character varying, a1 bigint) AS $BODY$
BEGIN
RETURN QUERY WITH rh AS (
SELECT (array_agg(status ORDER BY date DESC))[1] AS status, request
FROM requeststatushistory
WHERE date <= $3
GROUP BY request
)
SELECT
w.name,
kgn.name,
COUNT(*)
FROM kindergarten_request_table_materialized kr
JOIN rh ON rh.request = kr.id
JOIN requeststatuses s ON s.id = rh.status AND s.sysname IN ('confirmed', 'need_meet_completion', 'kindergarten_need_meet')
JOIN workareas kgn ON kr.kindergarten = kgn.id AND kgn.tree <# CAST($1 AS LTREE) AND kgn.active
JOIN organizationforms of ON of.id = kgn.organizationform AND of.sysname IN ('state','municipal','departmental')
JOIN workareas w ON w.tree #> kgn.tree AND w.active
JOIN workareatypes mt ON mt.id = w.type AND mt.sysname = 'management'
WHERE kr.requestyear = $4
GROUP BY kgn.name, w.name
ORDER BY w.name, kgn.name;
END
$BODY$ LANGUAGE PLPGSQL STABLE;
EXPLAIN ANALYZE SELECT * FROM report_children_without_place('83.86443.86445', '14-04-2015', '14-04-2015', 2014);
Total runtime: 242805.085 ms.
But query from function's body executes much faster:
EXPLAIN ANALYZE WITH rh AS (
SELECT (array_agg(status ORDER BY date DESC))[1] AS status, request
FROM requeststatushistory
WHERE date <= '14-04-2015'
GROUP BY request
)
SELECT
w.name,
kgn.name,
COUNT(*)
FROM kindergarten_request_table_materialized kr
JOIN rh ON rh.request = kr.id
JOIN requeststatuses s ON s.id = rh.status AND s.sysname IN ('confirmed', 'need_meet_completion', 'kindergarten_need_meet')
JOIN workareas kgn ON kr.kindergarten = kgn.id AND kgn.tree <# CAST('83.86443.86445' AS LTREE) AND kgn.active
JOIN organizationforms of ON of.id = kgn.organizationform AND of.sysname IN ('state','municipal','departmental')
JOIN workareas w ON w.tree #> kgn.tree AND w.active
JOIN workareatypes mt ON mt.id = w.type AND mt.sysname = 'management'
WHERE kr.requestyear = 2014
GROUP BY kgn.name, w.name
ORDER BY w.name, kgn.name;
Total runtime: 2156.740 ms.
Why function executed so longer than the same query? Thank's
Your query runs faster because the "variables" are not actually variable -- they are static values (IE strings in quotes). This means the execution planner can leverage indexes. Within your stored procedure, your variables are actual variables, and the planner cannot make assumptions about indexes. For example - you might have a partial index on requeststatushistory where "date" is <= '2012-12-31'. The index can only be used if the $3 is known. Since it might hold a date from 2015, the partial index would be of no use. In fact, it would be detrimental.
I frequently construct a string within my functions where I concatenate my variables as literals and then execute the function using something like the following:
DECLARE
my_dynamic_sql TEXT;
BEGIN
my_dynamic_sql := $$
SELECT *
FROM my_table
WHERE $$ || quote_literal($3) || $$::TIMESTAMPTZ BETWEEN start_time
AND end_time;$$;
/* You can only see this if client_min_messages = DEBUG */
RAISE DEBUG '%', my_dynamic_sql;
RETURN QUERY EXECUTE my_dynamic_sql;
END;
The dynamic SQL is VERY useful because you can actually get an explain of the query when I have set client_min_messages=DEBUG; I can scrape the query from the screen and paste it back in after EXPLAIN or EXPLAIN ANALYZE and see what the execution planner is doing. This also allows you to construct very different queries as needed to optimize for variables (IE exclude unnecessary tables if warranted) and maintain a common API for your clients.
You may be tempted to avoid the dynamic SQL for fear of performance issues (I was at first) but you will be amazed at how LITTLE time is spent in planning compared to some of the cost of a couple of table scans on your seven-table join!
Good luck!
Follow-up: You might experiment with Common Table Expressions (CTEs) for performance as well. If you have a table that has a low signal-to-noise ratio (has many, many more records in it than you actually want to return) then a CTE can be very helpful. PostgreSQL executes CTEs early in the query, and materializes the resulting rows in memory. This allows you to use the same result set multiple times and in multiple places in your query. The benefit can really be surprising if you design it correctly.
sql_txt := $$
WITH my_cte as (
select fk1 as moar_data 1
, field1
, field2 /*do not need all other fields taking up RAM!*/
from my_table
where field3 between $$ || quote_literal(input_start_ts) || $$::timestamptz
and $$ || quote_literal(input_end_ts) || $$::timestamptz
),
keys_cte as ( select key_field
from big_look_up_table
where look_up_name = ANY($$ ||
QUOTE_LITERAL(input_array_of_names) || $$::VARCHAR[])
)
SELECT field1, field2, moar_data1, moar_data2
FROM moar_data_table
INNER JOIN my_cte
USING (moar_data1)
WHERE moar_data_table.moar_data_key in (select key_field from keys_cte) $$;
An execution plan is likely to show that it chooses to use an index on moar_data_tale.moar_data_key. This would appear to go against what I said above in my prior answer - except for the fact that the keys_cte results are materialized (and therefore cannot be changed by another transaction in a race-condition) -- you have your own little copy of the data for use in this query.
Oh - and CTEs can use other CTEs that are declared earlier in the same query. I have used this "trick" to replace sub-queries in very complex joins and seen great improvements.
Happy Hacking!
I have a large database, that I want to do some logic to update new fields.
The primary key is id for the table harvard_assignees
The LOGIC GOES LIKE THIS
Select all of the records based on id
For each record (WHILE), if (state is NOT NULL && country is NULL), update country_out = "US" ELSE update country_out=country
I see step 1 as a PostgreSQL query and step 2 as a function. Just trying to figure out the easiest way to implement natively with the exact syntax.
====
The second function is a little more interesting, requiring (I believe) DISTINCT:
Find all DISTINCT foreign_keys (a bivariate key of pat_type,patent)
Count Records that contain that value (e.g., n=3 records have fkey "D","388585")
Update those 3 records to identify percent as 1/n (e.g., UPDATE 3 records, set percent = 1/3)
For the first one:
UPDATE
harvard_assignees
SET
country_out = (CASE
WHEN (state is NOT NULL AND country is NULL) THEN 'US'
ELSE country
END);
At first it had condition "id = ..." but I removed that because I believe you actually want to update all records.
And for the second one:
UPDATE
example_table
SET
percent = (SELECT 1/cnt FROM (SELECT count(*) AS cnt FROM example_table AS x WHERE x.fn_key_1 = example_table.fn_key_1 AND x.fn_key_2 = example_table.fn_key_2) AS tmp WHERE cnt > 0)
That one will be kind of slow though.
I'm thinking on a solution based on window functions, you may want to explore those too.