Optimising computing time on sql query - postgresql

I usually use the update query to change or update a column in my PostgreSQL database.
I create subqueries from the data in a second schema to integrate it with the column to be updated.
In this code I update the 'unit_source_concept_id' column of the measurement table in the OMOP CDM schema from a table in another schema ('transform_semantic' schema).
with subquery1 as (
select unit_source_concept_id
from transform_semantique.measurement )
update omop.measurement m
set unit_source_concept_id = subquery1.unit_source_concept_id
from subquery1
where measurement_source_concept_id <> 5;
I don't know if the update query is the most suitable and optimal (in terms of computing time). This request took +6000s to execute.
Do you know a method to optimise this query?

Related

create 2 indexes on same column

I have a table with geometry column.
I have 2 indexes on this column:
create index idg1 on tbl using gist(geom)
create index idg2 on tbl using gist(st_geomfromewkb((geom)::bytea))
I have a lot of queries using the geom (geometry) field.
Which index is used ? (when and why)
If there are two indexes on same column (as I show here), can the select queries run slower than define just one index on column ?
The use of an index depends on how the index was defined, and how the query is invoked. If you SELECT <cols> FROM tbl WHERE geom = <some_value>, then you will use the idg1 index. If you SELECT <cols> FROM tabl WHERE st_geomfromewkb(geom) = <some_value>, then you will use the idg2 index.
A good way to know which index will be used for a particular query is to call the query with EXPLAIN (i.e., EXPLAIN SELECT <cols> FROM tbl WHERE geom = <some_value>) -- this will print out the query plan, which access methods, which indexes, which joins, etc. will be used.
For your question regarding performance, the SELECT queries could run slower because there are more indexes to consider in the query planning phase. In terms of executing a given query plan, a SELECT query will not run slower because by then the query plan has been established and the decision of which index to use has been made.
You will certainly experience performance impact upon INSERT/UPDATE/DELETE of the table, as all indexes will need to be updated with respect to the changes in the table. As such, there will be extra I/O activity on disk to propagate the changes, slowing down the database, especially at scale.
Which index is used depends on the query.
Any query that has
WHERE geom && '...'::geometry
or
WHERE st_intersects(geom, '...'::geometry)
or similar will use the first index.
The second index will only be used for queries that have the expression st_geomfromewkb((geom)::bytea) in them.
This is completely useless: it converts the geometry to EWKB format and back. You should find and rewrite all queries that have this weird construct, then you should drop that index.
Having two indexes on a single column does not slow down your queries significantly (planning will take a bit longer, but I doubt if you can measure that). You will have a performance penalty for every data modification though, which will take almost twice as long as with a single index.

Redshift large 'in' clause best practices

We have a query in which a list of parameter values is provided in "IN" clause of the query. Some time back this query failed to execute as the size of data in "IN" clause got quite large and hence the resulting query exceeded the 16 MB limit of the query in REDSHIFT. As a result of which we then tried processing the data in batches so as to limit the data and not breach the 16 MB limit.
My question is what are the factors/pitfalls to keep in mind while supplying such large data for the "IN" clause of a query or is there any alternative way in which I can deal with such large data for the "IN" clause?
If you have control over how you are generating your code, you could split it up as follows
first code to be submitted, drop and recreate filter table:
drop table if exists myfilter;
create table myfilter (filter_text varchar(max));
Second step is to populate the filter table in parts of a suitable size, e.g. 1000 values at a time
insert into myfilter
values({{myvalue1}},{{myvalue2}},{{myvalue3}} etc etc up to 1000 values );
repeat the above step multiple times until you have all of your values inserted
Then, use that filter table as follows
select * from master_table
where some_value in (select filter_text from myfilter);
drop table myfilter;
Large IN is not the best practice itself, it's better to use joins for large lists:
construct a virtual table a subquery
join your target table to the virtual table
like this
with
your_list as (
select 'first_value' as search_value
union select 'second_value'
...
)
select ...
from target_table t1
join your_list t2
on t1.col=t2.search_value

Unable to optimise Redshift query

I have build a system where data is loaded from s3 into redshift every few minutes (from a kinesis firehose). I then grab data from that main table and split it into a table per customer.
The main table has a few hundred million rows.
creating the subtable is done with a query like this:
create table {$table} as select * from {$source_table} where customer_id = '{$customer_id} and time between {$start} and {$end}'
I have keys defined as:
SORTKEY (customer_id, time)
DISTKEY customer_id
Everything I have read suggests this would be the optimal way to structure my tables/queries but the performance is absolutely awful. building the sub tables takes over a minute even with only a few rows to select.
Am I missing something or do I just need to scale the cluster?
If you do not have a better key you may have to consider using DISTSTYLE EVEN, keeping the same sort key.
Ideally the distribution key should be a value that is used in joins and spreads your data evenly across the cluster. By using customer_id as the distribution key and then filtering on that key you're forcing all work to be done on just one slice.
To see this in action look in the system tables. First, find an example query:
SELECT *
FROM stl_query
WHERE userid > 1
ORDER BY starttime DESC
LIMIT 10;
Then, look at the bytes per slice for each step of you query in svl_query_report:
SELECT *
FROM svl_query_report
WHERE query = <your query id>
ORDER BY query,segment,step,slice;
For a very detailed guide on designing the best table structure have a look at our "Amazon Redshift Engineering’s Advanced Table Design Playbook"

Update from existing table in Redshift

I would like to update a value in Redshift table from results of other table, I'm trying to run to following query but received an error.
update section_translate
set word=t.section_type
from (
select distinct section_type from mr_usage where section_type like '%sディスコ')t
where word = '80sディスコ'
The error I received:
ERROR: Target table must be part of an equijoin predicate
Can't understand what is incorrect in my query.
You need to make the uncorrelated subquery to a correlated subquery,
update section_translate
set word=t.section_type
from (
select distinct section_type,'80sディスコ' as word from mr_usage where section_type like '%sディスコ')t
where section_translate.word = t.word
Otherwise, each record of the outer query is eligible for updates and the query engine rejects it. The way Postgre (and thus Redshift) evaluates uncorrelated subqueries is slightly different from SQL Server/ Oracle etc.

Optimizing sybase SQL Query

Is there a way to optimize the following query:
UPDATE myTable
SET Calculation =
(SELECT MAX(Calculation)
FROM myTable T
WHERE T.Id = myTable.Id
AND T.Flag='N')
WHERE Calculation='NA'
AND Flag='Y'
where myTable has approx. 4 million rows? Actually the first not NULL will do the job (SYBASE ASE 15.0.2).
Check Query plan, it must be using deferred update which is taking longer time to update.
Query suggested by Michael should perform better.
rememeber below points which require deferred update
Updates that use self-joins
Updates to columns used for self-referential integrity
Updates to a table referenced in a correlated subquery
Thanks..