Alternative when IN clause is inputed A LOT of values (postgreSQL) - postgresql

I'm using the IN clause to retrieve places that contains certain tags. For that I simply use
select .. FROM table WHERE tags IN (...)
For now the number of tags I provide in the IN clause is around 500) but soon (in the near future) number tags will probably jump off to easily over 5000 (maybe even more)
I would guess there is some kind of limition in both the size of the query AND in the number values in the IN clause (bonus question for curiosity what is this value?)
So my question is what is a good alternative query that would be future proof even if in the future I would be matching against let's say 10'000 tags ?
ps: I have looked around and see people mentioning "temporary table". I have never used those. How will they be used in my case? Will i need to create a temp table everytime I make a query ?
Thanks,
Francesco

One option is to join this to a values clause
with parms (tag) as (
values ('tag1'), ('tag2'), ('tag3')
)
select t.*
from the_table t
join params p on p.tag = t.tag;

You could create a table using:
tablename
id | tags
----+----------
1 | tag1
2 | tag2
3 | tag3
And then do:
select .. FROM table WHERE tags IN (SELECT * FROM tablename)

Related

Smart way to filter out unnecessary rows from Query

So I have a query that shows a huge amount of mutations in postgres. The quality of data is bad and i have "cleaned" it as much as possible.
To make my report so user-friendly as possible I want to filter out some rows that I know the customer don't need.
I have following columns id, change_type, atr, module, value_old and value_new
For change_type = update i always want to show every row.
For the rest of the rows i want to build some kind of logic with a combination of atr and module.
For example if the change_type <> 'update' and concat atr and module is 'weightperson' than i don't want to show that row.
In this case id 3 and 11 are worthless and should not be shown.
Is this the best way to solve this or does anyone have another idea?
select * from t1
where concat(atr,module) not in ('weightperson','floorrentalcontract')
In the end my "not in" part will be filled with over 100 combinations and the query will not look good. Maybe a solution with a cte would make it look prettier and im also concerned about the perfomance..
CREATE TABLE t1(id integer, change_type text, atr text, module text, value_old text, value_new text) ;
INSERT INTO t1 VALUES
(1,'create','id','person',null ,'9'),
(2,'create','username','person',null ,'abc'),
(3,'create','weight','person',null ,'60'),
(4,'update','id','order','4231' ,'4232'),
(5,'update','filename','document','first.jpg' ,'second.jpg'),
(6,'delete','id','rent','12' ,null),
(7,'delete','cost','rent','600' ,null),
(8,'create','id','rentalcontract',null ,'110'),
(9,'create','tenant','rentalcontract',null ,'Jack'),
(10,'create','rent','rentalcontract',null ,'420'),
(11,'create','floor','rentalcontract',null ,'1')
Fiddle
You could put the list of combinations in a separate table and join with that table, or have them listed directly in a with-clause like this:
with combinations_to_remove as (
select *
from (values
('weight', 'person'),
('floor' ,'rentalcontract')
) as t (atr, module)
)
select t1.*
from t1
left join combinations_to_remove using(atr, module)
where combinations_to_remove.atr is null
I guess it would be cleaner and easier to maintain if you put them in a separate table!
Read more on with-queries if that sounds strange to you.

Optimizing a query with multiple IN

I have a query like this:
SELECT * FROM table
WHERE department='param1' AND type='param2' AND product='param3'
AND product_code IN (10-30 alphanumerics) AND unit_code IN (10+ numerics)
AND first_name || last_name IN (10-20 names)
AND sale_id LIKE ANY(list of regex string)
Runtime was too high so I was asked to optimize it.
The list of parameters varies for the code columns for different users.
Each user provides their list of codes and then loops over product.
product used to be an IN clause list as well but it was split up.
Things I tried
By adding an index on (department, type and product) I was able to get a 4x improvement.
Current runtime is that some values of product only take 2-3 seconds, while others take 30s.
Tried creating a pre-concat'd column of first_name || last_name, but the runtime improvement was too small to be worth it.
Is there some way I can improve the performance of the other clauses, such as the "IN" clauses or the LIKE ANY clause?
In my experience replacing large IN lists, with a JOIN to a VALUES clause often improves performance.
So instead of:
SELECT *
FROM table
WHERE department='param1'
AND type='param2'
AND product='param3'
AND product_code IN (10-30 alphanumerics)
Use:
SELECT *
FROM table t
JOIN ( values (1),(2),(3) ) as x(code) on x.code = t.product_code
WHERE department='param1'
AND type='param2'
AND product='param3'
But you have to make sure you don't have any duplicates in the values () list
The concatenation is also wrong because the concatenated value is something different then comparing each value individually, e.g. ('alexander', 'son') would be treated identical to ('alex', 'anderson')`
You should use:
and (first_name, last_name) in ( ('fname1', 'lname1'), ('fname2', 'lname2'))
This can also be written as a join
SELECT *
FROM table t
JOIN ( values (1),(2),(3) ) as x(code) on x.code = t.product_code
JOIN (
values ('fname1', 'lname1'), ('fname2', 'lname2')
) as n(fname, lname) on (n.fname, n.lname) = (t.first_name, t.last_name)
WHERE department='param1'
AND type='param2'
AND product='param3'
You generally don't have to do anything special to enable an index for it to be used with multiple IN-lists, other than keep the table well vacuumed and analyzed. A btree index on (department, type, product, product_code, unit_code, (first_name || last_name)) should work well. If it doesn't, please show an EXPLAIN (ANALYZE, BUFFERS) for it, preferably with track_io_timing turned on. If the selectivities of each of your conditions are not mostly independent of each other, that might lead to planning problems.

Using a list as replacement for singular patterns in regexp_replace

I have a table that I need to delete random words/characters out of. To do this, I have been using a regexp_replace function with the addition of multiple patterns. An example is below:
select regexp_replace(combined,'\y(NAME|001|CONTAINERS:|MT|COUNT|PCE|KG|PACKAGE)\y','', 'g')
as description, id from export_final;
However, in the full list, there are around 70 different patterns that I replace out of the description. As you can imagine, the code if very cluttered: This leads me to my question. Is there a way to put these patterns into another table then use that table to check the descriptions?
Of course. Populate your desired 'other' table with what patterns you need. Then create a CTE that uses string_agg function to build the regex. Example:
create table exclude_list( pattern_word text);
insert into exclude_list(pattern_word)
values('NAME'),('001'),('CONTAINERS:'),('MT'),('COUNT'),('PCE'),('KG'),('PACKAGE');
with exclude as
( select '\y(' || string_agg(pattern_word,'|') || ')\y' regex from exclude_list )
-- CTE simulates actual table to provide test data
, export_final (id,combined) as (values (0,'This row 001 NAME Main PACKAGE has COUNT 3 units'),(1,'But single package can hold 6 KG'))
select regexp_replace(combined,regex,'', 'g')
as description, id
from export_final cross join exclude;

Keyword search using PostgreSQL

I am trying to identify observations from my data using a list of keywords.However, the search results contains observations where only part of the keyword matches. For instance the keyword ice returns varices. I am using the following code
select *
from mytab
WHERE myvar similar to'%((ice)|(cool))%';
I tried the _tsquery and it does the exact match and does not include observations with varices. But this approach is taking significantly longer to query. (2 keyword search for similar to '% %' takes 5 secs, whereas _tsquerytakes 30 secs for 1 keyword search.I have more than 900 keywords to search)
select *
from mytab
where myvar ## to_tsquery(('ice'));
Is there a way to query multiple keywords using the _tsquery and any way to speed the querying process.
I'd suggest using keywords in a relational sense rather than having a running list of them under one field, which makes for terrible performance. Instead, you can have a table of keywords with id's as primary keys and have foreign keys referring to mytab's primary keys. So you'd end up with the following:
keywords table
id | mytab_id | keyword
----------------------
1 1 liver
2 1 disease
3 1 varices
4 2 ice
mytab table
id | rest of fields
---------------------
1 ....
2 ....
You can then do an inner join to find what keywords belong to the specified entries in mytab:
SELECT * FROM mytab
JOIN keywords ON keywords.mytab_id = mytab.id
WHERE keyword = 'ice'
You could also add a constraint to make sure the keyword and mytab_id pair is unique, that way you don't accidentally end up with the same keyword for the same entry in mytab.

SqlAlchemy: count of distinct over multiple columns

I can't do:
>>> session.query(
func.count(distinct(Hit.ip_address, Hit.user_agent)).first()
TypeError: distinct() takes exactly 1 argument (2 given)
I can do:
session.query(
func.count(distinct(func.concat(Hit.ip_address, Hit.user_agent))).first()
Which is fine (count of unique users in a 'pageload' db table).
This isn't correct in the general case, e.g. will give a count of 1 instead of 2 for the following table:
col_a | col_b
----------------
xx | yy
xxy | y
Is there any way to generate the following SQL (which is valid in postgresql at least)?
SELECT count(distinct (col_a, col_b)) FROM my_table;
distinct() accepts more than one argument when appended to the query object:
session.query(Hit).distinct(Hit.ip_address, Hit.user_agent).count()
It should generate something like:
SELECT count(*) AS count_1
FROM (SELECT DISTINCT ON (hit.ip_address, hit.user_agent)
hit.ip_address AS hit_ip_address, hit.user_agent AS hit_user_agent
FROM hit) AS anon_1
which is even a bit closer to what you wanted.
The exact query can be produced using the tuple_() construct:
session.query(
func.count(distinct(tuple_(Hit.ip_address, Hit.user_agent)))).scalar()
Looks like sqlalchemy distinct() accepts only one column or expression.
Another way around is to use group_by and count. This should be more efficient than using concat of two columns - with group by database would be able to use indexes if they do exist:
session.query(Hit.ip_address, Hit.user_agent).\
group_by(Hit.ip_address, Hit.user_agent).count()
Generated query would still look different from what you asked about:
SELECT count(*) AS count_1
FROM (SELECT hittable.user_agent AS hittableuser_agent, hittable.ip_address AS sometable_column2
FROM hittable GROUP BY hittable.user_agent, hittable.ip_address) AS anon_1
You can add some variables or characters in concat function in order to make it distinct. Taking your example as reference it should be:
session.query(
func.count(distinct(func.concat(Hit.ip_address, "-", Hit.user_agent))).first()