PostgreSQL WHERE IN ILIKE query - postgresql

I was wondering if it's possible to do a query performance of below query which is taking logger time than response timeout:
SELECT * FROM products p
WHERE lower(name) ILIKE 'BL%'
OR lower(name) ILIKE 'Bule%'
AND p.site_id = 123
AND p.product_type = 0
ORDER BY external_id ASC LIMIT 25;
Previously it was working as in DB data was less. (Using PostgreSQL)
==> name varchar(255)
I tried similar to and any operator but these are not giving same records and better performance.

Related

PostgreSQL: optimization of query for select with overlap

I create the following query for selection data with overlapping periods (for campaigns, which have the same business identifier!):
select
campaign_instance_1.campaign_id,
campaign_instance_1.start_time
from campaign_instance as campaign_instance_1
inner join campaign_instance as campaign_instance_2
on campaign_instance_1.campaign_id = campaign_instance_2.campaign_id
and (
(campaign_instance_1.start_time between campaign_instance_2.start_time and campaign_instance_2.finish_time)
or (campaign_instance_1.finish_time between campaign_instance_2.start_time and campaign_instance_2.finish_time)
or (campaign_instance_1.start_time<campaign_instance_2.start_time and campaign_instance_1.finish_time>campaign_instance_2.finish_time)
or (campaign_instance_1.start_time>campaign_instance_2.start_time and campaign_instance_1.finish_time<campaign_instance_2.finish_time))
With index, created as:
CREATE INDEX IF NOT EXISTS camp_inst_idx_campaign_id_and_finish_time
ON public.campaign_instance_without_index USING btree
(campaign_id ASC NULLS LAST, finish_time DESC NULLS LAST)
TABLESPACE pg_default;
Already on 100 000 rows it runs very slowly - 43 seconds!
For optimization I tried to add index on start_time:
(campaign_id ASC NULLS LAST, finish_time DESC NULLS LAST, start_time DESC NULLS LAST)
But result is the same.
As I understand results of explain analyze, index "start_time" doesn't uses as a Index Condition:
I tried the query with this index either with 10 000 and 100 000 rows - so, as possible, it does not depends on sample size (at least on this scales).
Source table contains the following structure:
campaign_id bigint,
fire_time bigint,
start_time bigint,
finish_time bigint,
recap character varying,
details json
Why my index is not used, and what possible ways to improve the query?
Joining to campaign_instance (itself) doesn't really serve anything here other than making an "existence" check and probably your intention is not to get back duplicates for matching records. Thus you could simplify the query with EXISTS or LATERAL join. Also your join condition on time could be simplified, you seem to be looking for overlapping times:
select campaign_id,start_time
from campaign_instance c1
where exists( select * from campaign_instance c2
where c1.campaign_id = c2.campaign_id
and (c1.start_time <= c2.finish_time and c1.finish_time >= c2.start_time));
That time overlap probably would use < and > instead of <= and >= but I don't know your exact requirements, between is implicitly saying it is <= and >=.
EDIT: Ensure that the match is not the row itself:
(This table should have a primary key to make things easier, but as it doesn't, I would assume that there is no duplication on campaign_id, start_time and finish_time and that could be used as a composite key)
select campaign_id,start_time
from campaign_instance c1
where exists( select * from campaign_instance c2
where c1.campaign_id = c2.campaign_id
and (c1.start_time != c2.start_time or c1.finish_time != c2.finish_time)
and (c1.start_time <= c2.finish_time and c1.finish_time >= c2.start_time));
This takes around 230-250 milliseconds on my system (iMac i5 7500, 3.4 Ghz, 64 Gb mem).

Nested SQL Query Optimization

is there any better way to write this query to be optimized?
SELECT * FROM data d
WHERE d.id IN (SELECT max(d1.id)
FROM data d1
WHERE d1.name='A'
AND d1.version='2')
I am not so good with SQL.
With PostgreSQL v13, you can do it like this:
SELECT * FROM data
WHERE name = 'A'
AND version = '2'
ORDER BY is DESC
FETCH FIRST 1 ROWS WITH TIES;
That will give you all rows where id is the maximum.
If id is unique, you can use FETCH FIRST 1 ROWS ONLY or LIMIT 1, which will also work with older PostgreSQL versions.
Apart from other answers that are equally interesting / correct, IN is typically a non-performant keyword. You can remove it by using a slightly different way of writing your own query:
SELECT * FROM data d
WHERE d.name = 'A' and d.version = '2' and
d.id = (SELECT max(d1.id) FROM data d1 WHERE d1.name='A' AND d1.version='2')

Cassandra filter with ordering query modeling

I am new to Cassandra and I am trying to model a table in Cassandra. My queries look like the following
Query #1: select * from TableA where Id = "123"
Query #2: select * from TableA where name="test" orderby startTime DESC
Query #3: select * from TableA where state="running" orderby startTime DESC
I have been able to build the table for Query #1 which looks like
val tableAStatement = SchemaBuilder.createTable("tableA").ifNotExists.
addPartitionKey(Id, DataType.uuid).
addColumn(Name, DataType.text).
addColumn(StartTime, DataType.timestamp).
addColumn(EndTime, DataType.timestamp).
addColumn(State, DataType.text)
session.execute(tableAStatement)
but for Query#2 and 3, I have tried many different things but failed. Everytime, I get stuck in a different error from cassandra.
Considering the above queries, what would be the right table model? What is the right way to model such queries.
Query #2: select * from TableB where name="test"
CREATE TABLE TableB (
name text,
start_time timestamp,
PRIMARY KEY (text, start_time)
) WITH CLUSTERING ORDER BY (start_time DESC)
Query #3: select * from TableC where state="running"
CREATE TABLE TableC (
state text,
start_time timestamp,
PRIMARY KEY (state, start_time)
) WITH CLUSTERING ORDER BY (start_time DESC)
In cassandra you model your tables around your queries. Data denormalization and duplication is wanted. Notice the clustering order - this way you can omit the "ordered by" in your query

Postgresql OR conditions with an empty subquery

How can I optimize a query whose WHERE conditions include a check for user_id = X OR user_id IN (some subquery that might return no results)
In my example below, queries 1 and 2 are both extremely fast (< 1 ms), but query 3, which is simply an OR of the conditions in queries 1 and 2, is much slower (50 ms)
Can somebody please explain why query 3 is so slow, and in general what types of query optimization strategies should I be pursuing to avoid this problem? I realize the subquery in my example could easily be eliminated, but in real life sometimes subqueries seem like the least complicated way to get the data I want.
relevant code and data:
posts data
https://dl.dropbox.com/u/4597000/StackOverflow/sanitized_posts.csv
users data
https://dl.dropbox.com/u/4597000/StackOverflow/sanitized_users.csv
# from the shell:
# > createdb test
CREATE TABLE posts (
id integer PRIMARY KEY NOT NULL,
created_by_id integer NOT NULL,
created_at integer NOT NULL
);
CREATE INDEX index_posts ON posts (created_by_id, created_at);
CREATE INDEX index_posts_2 ON posts (created_at);
CREATE TABLE users (
id integer PRIMARY KEY NOT NULL,
login varchar(50) NOT NULL
);
CREATE INDEX index_users ON users (login);
COPY posts FROM '/path/to/sanitized_posts.csv' DELIMITERS ',' CSV;
COPY users FROM '/path/to/sanitized_users.csv' DELIMITERS ',' CSV;
-- queries:
-- query 1, fast:
EXPLAIN ANALYZE SELECT * FROM posts WHERE created_by_id = 123 LIMIT 100;
-- query 2, fast:
EXPLAIN ANALYZE SELECT * FROM posts WHERE created_by_id IN (SELECT id FROM users WHERE login = 'nobodyhasthislogin') LIMIT 100;
-- query 3, slow:
EXPLAIN ANALYZE SELECT * FROM posts WHERE created_by_id = 123 OR created_by_id IN (SELECT id FROM users WHERE login = 'nobodyhasthislogin') LIMIT 100;
Split the query (edited):
SELECT * FROM (
SELECT * FROM posts p WHERE p.created_by_id = 123
union
SELECT * FROM posts p
WHERE
EXISTS ( SELECT TRUE FROM users WHERE id = p.created_by_id AND login = 'nobodyhasthislogin')
) p
LIMIT 100;
How about:
EXPLAIN ANALYZE
SELECT *
FROM posts
WHERE created_by_id IN (
SELECT 123
UNION ALL
SELECT id FROM
users WHERE
login = 'nobodyhasthislogin') LIMIT 100;
Most of the time in this particular query is related to an index scan. Here is a query goes at it from a different angle to avoid this, but should return equivalent results.
SELECT posts.* FROM users JOIN posts on posts.created_by_id=users.id WHERE users.id=123 or login='nobodyhasthislogin'
This selects from the users table, doing the filter once, and then joins posts onto that.
I realize that the question is about tips for optimization, not really this specific query. To answer that, my suggestion is to run EXPLAIN ANALYZE and read up on interpreting the results, - this answer was helpful to me.

Table cardinality information in Firebird

I am new to Firebird and I am messing around in it's meta data to get some information about the table structure and etc.
My problem is that I can't seem to find some information about estimated table cardinality. Is there a way to get this information from Firebird?
Edit:
By cardinality i mean the number of rows in a table :) and for my use the select count(*) is not an option.
You can use an aproximative method, using the selectivity of primary key like this:
SELECT
R.RDB$RELATION_NAME TABLENAME,
(
CASE
WHEN I.RDB$STATISTICS = 0 THEN 0
ELSE 1 / I.RDB$STATISTICS
END) AS COUNTRECORDS8
FROM RDB$RELATIONS R
JOIN RDB$RELATION_CONSTRAINTS C ON (R.RDB$RELATION_NAME = C.RDB$RELATION_NAME AND C.RDB$CONSTRAINT_TYPE = 'PRIMARY KEY')
JOIN RDB$INDICES I ON (I.RDB$RELATION_NAME = C.RDB$RELATION_NAME AND I.RDB$INDEX_NAME = C.RDB$INDEX_NAME)
To get the number of rows in a table you use the COUNT() function as in any other SQL DB, ie
SELECT count(*) FROM table;
Why not use a query:
select count(distinct field_name)/(count(field_name) + 0.0000) from table_name
The closer result to 1 the higher cardinality of a specified column.