Matching patterns between multiple columns - postgresql

I have two columns say Main and Sub. (they can be of same table or not).
Main is varchar of length 20 and Sub is varchar of length 8.
Sub is always subset of Main and it is last 8 characters of Main.
I could successfully design a query to match pattern using substr("Main",13,8)
Query:
select * from "MainTable"
where substr("MainColumn",13,8) LIKE (
select "SubColumn" From "SubTable" Where "SubId"=1043);
but I want to use Like, % , _ etc in my query so that I can loosely match the pattern (that is not all 8 characters).
Question is how can i do that.?!
I know that the query below is COMPLETELY WRONG but I want to achieve something like this,
Select * from "MainTable"
Where "MainColumn" Like '%' Select "SubColumn" From "SubTable" Where "SubId"=2'

The answers so far fail to address your question:
but I want use Like, % , _ etc in my query so that I can loosely match
the pattern (that is not all 8 characters).
It makes hardly any difference whether you use LIKE or = as long as you match the whole string (and there are no wildcard character in your string). To make the search fuzzy, you need to replace part of the pattern, not just add to it.
For instance, to match on the last 7 (instead of 8) characters of subcolumn:
SELECT *
FROM maintable m
WHERE left(maincolumn, 8) LIKE
( '%' || left((SELECT subcolumn FROM subtable WHERE subid = 2), 7));
I use the simpler left() (introduced with Postgres 9.1).
You could simplify this to:
SELECT *
FROM maintable m
WHERE left(maincolumn, 7) =
(SELECT left(subcolumn,7) FROM subtable WHERE subid = 2);
But you wouldn't if you use the special index I mention further down, because expressions in functional indexes have to matched precisely to be of use.
You may be interested in the extension pg_tgrm.
In PostgreSQL 9.1 run once per database:
CREATE EXTENSION pg_tgrm;
Two reasons:
It supplies the similarity operator %. With it you can build a smart similarity search:
--SELECT show_limit();
SELECT set_limit(0.5); -- adjust similarity limit for % operator
SELECT *
FROM maintable m
WHERE left(maincolumn, 8) %
(SELECT subcolumn FROM subtable WHERE subid = 2);
It supplies index support for both LIKE and %
If read performance is more important than write performance, I suggest you create a functional GIN or GiST index like this:
CREATE INDEX maintable_maincol_tgrm_idx ON maintable
USING gist (left(maincolumn, 8) gist_trgm_ops);
This index supports either query. Be aware that it comes with some cost for write operations.
A quick benchmark for a similar case in this related answer.

Try
SELECT t1.* from "Main Table" AS t1, "SubTable" AS t2
WHERE t2.SubId=1043
AND substr(t1.MainColumn, 13, 8) LIKE "%" || CAST(t2.SubColumn as text);

Argument to a LIKE is an ordinary string, so all string manipulations are valid here.
In your case you need to concatenate wildchars with the target substring, like #bksi suggests:
... LIKE '%'||CAST("SubColumn" AS test) ...
Note, though, that such patterns (the ones starting with a % wildcard) are badly performing ones. Take a look at PostgreSQL LIKE query performance variations.
I would recommend:
sticking with the current substr("MainColumn", 13, 8) approach;
avoid LIKE and use equality comparison (=) instead (although they're equal if LIKE pattern contains no wildcards, it is easier to read the query);
build an expression index on the "MainTable" the following way:
CREATE INDEX i_maincolumn ON "MainTable" (substr("MainColumn", 13, 8));
This combination will perform better in my view.
And use lowercase names for the tables/columns, so that you can avoid doublequoting them.

Related

Matching performance with pattern from table column

I have a query which looks like:
SELECT *
FROM my_table
WHERE 'some_string' LIKE mytable.some_column || '%%'
How can I index some_column to improve this query performance?
Or is the a better way to filter this?
This predicate effectively searches for all prefixes for a given string:
WHERE 'some_string' LIKE mytable.some_column || '%'
Maybe % is a special character in your client, which needs to be escaped with another % to pass a literal %, else '%%' would be just noise and can be replaced with '%'.
The most efficient solution should be a recursive CTE (or similar) that matches to every prefix exactly, starting with some_column = left('some_string', 1), up to some_column = left('some_string', length('some_string')) (= 'some_string').
You only need a plain btree index on the column for this. Depending on details of your implementation, partial expression indexes might improve performance ...
Related:
Reverse pattern matching: find the longest prefix
Algorithm for finding the longest prefix
PostgreSQL LIKE query performance variations
Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL
I believe you intend to write the following query:
SELECT *
FROM my_table
WHERE mytable.some_column LIKE 'some_string%';
In other words, you want to find records where some column begins with some_string followed by anything, possibly nothing at all.
As far as I know, a regular B-tree index on some_column will be effective, to a point, in your query. The reason is that Postgres can traverse the tree looking for the prefix some_string. Once it has found that entry, beyond that the index might not help. But an index on some_column should give you some performance benefit here.
A condition where an index would not help would be the following:
WHERE mutable.some_column LIKE '%some_string';
In this case, the index is rendered mostly useless, because we have no idea with what prefix the column value should begin.

Postgresql: how to set a weight for tsquery

How to set a weight for tsquery? I need to set a weight for tsquery obtained from plainto_tsquery.
Is it possible? Something like setweight(plainto_tsquery(''), 'A'), but it works only for tsvector.
I have this problem too. My use case is large documents, many sections, and I wish to provide an option for "search heading text only". (Headings have weight A and are scattered throughout the document; other sections have weight B, C or D depending upon where they occur.)
Here are two solutions that should help.
Solution 1: setweight function for tsquery
The function converts the tsquery to text, applies a regular expression to set the weights, then coverts back to tsquery.
CREATE FUNCTION setweight(query tsquery, weights text) RETURNS tsquery AS $$
SELECT regexp_replace(
query::text,
'(?<=[^ !])'':?(\*?)A?B?C?D?', ''':\1'||weights,
'g'
)::tsquery;
$$ LANGUAGE SQL IMMUTABLE;
Example:
select setweight( plainto_tsquery('fat cats and rats'), 'A' );
-- 'fat':A & 'cat':A & 'rat':A
select setweight( phraseto_tsquery('fat cats and rats'), 'A' );
-- 'fat':A <-> 'cat':A <2> 'rat':A
select setweight( to_tsquery('fat & (cat:A & rat) & !dog:*CD'), 'BC' );
-- 'fat':BC & 'cat':BC & 'rat':BC & !'dog':*BC
Solution 2: Functional index based on filtered tsvector
First create additional indexes on the fulltext column you'll be searching on.
e.g.
CREATE INDEX fulltext_idx
ON your_table USING gin
(fulltext)
CREATE INDEX fulltext_idx_A
ON your_table USING gin
(ts_filter(fulltext, '{a}'))
CREATE INDEX fulltext_idx_AB
ON your_table USING gin
(ts_filter(fulltext, '{a,b}'))
For whatever combination of weights you need.
Then, when searching, use the filtered expression. e.g.:
SELECT *
FROM your_table
WHERE ts_filter(fulltext, '{a}') ## plainto_tsquery('your query')
The search should take place on the indexed expression.
Discussion
Solution 1 gives you the function you're looking for, but the problem with weighted queries is that although postgres will use the index to find candidate matches, it still needs to pull back each document to check the weights.
In my case, when searching by titles only, Solution 2 appears to give better performance. The text within titles (weight A) uses a much smaller vocabulary than in the whole document, so the fulltext_idx_A is considerably smaller than fulltext_idx and the results don't need rechecked after matching.
For your own case, performance will depend entirely on your own document structure and the nature of your queries, so test using 'explain analyse' to select the better solution. Given the age of the ticket mind you, I assume you've solved this one already :-)
Note: ts_filter() and phraseto_tsquery() are from Postgres 9.6.
Here is the Best article about Postgres Full Text Search :
https://www.compose.com/articles/mastering-postgresql-tools-full-text-search-and-phrase-search/
and you can also set weight by using :
setweight(to_tsvector(coalesce($columnName, '')), '$weight')
Where column name something like users.name (table.column)
And Weight you want E.g A, B or C

PostgreSQL query on a text column ignoring special characters

I have a table which contains a text column, say vehicle number.
Now I want to query the table for fields which contain a particular vehicle number.
While matching I do not want to consider non-alphanumeric characters.
example: query condition - DEL123
should match - DEL-123, DEL/123, DEL#123, etc...
If you know which characters to skip, put them as the second parameter of this translate() call (which is faster than regexp functions):
select *
from a_table
where translate(code, '-/#', '') = 'DEL123';
Else, you can compare only alphanumeric characters using regexp_replace():
select *
from a_table
where regexp_replace(code, '[^[:alnum:]]', '', 'g') = 'DEL123';
#klin's answer is great, but is not sargable, so in cases where you're searching through millions of records (maybe not your case, but perhaps someone else with a similar question looking for answers), using regular expressions will likely render much better results.
The following will use indexes on code significantly reducing the number of rows tested:
select *
from a_table
where code ~ '^DEL[^[:alnum:]]*123$';

Comparing data in a column of one table with the same column in another table

I have two tables temp and md respectively. There is a field called uri_stem which has certain details that I want to omit from temp but not from md. I need to make a comparison that is able to compare certain patterns and remove them from temp if there are similar patterns in md.
Right now I am using this code to remove data similar to the patterns I want to omit, but I want some method that is able to compare the patterns from the md table rather than me hardcording each one. Hope the explanation is clear enough.
FROM
spfmtr01.tbl_1c_apps_log_temp
where
uri_stem not like '%.js' and
uri_stem not like '%.css' and
uri_stem not like '%.gif'
and uri_stem not like '%.png'
and uri_stem not like '%.html'
and uri_stem not like '%.jpg'
and uri_stem not like '%.jpeg'
and uri_stem not like '%.ico'
and uri_stem not like '%.htm'
and uri_stem not like '%.pdf'
and uri_stem not like '%.Png'
and uri_stem not like '%.PNG'
This example is based on answer I mentioned in my comment.
SQLFiddle
Sample data:
drop table if exists a, b;
create table a (testedstr varchar);
create table b (condstr varchar);
insert into a values
('aa.aa.jpg'),
('aa.aa.bjpg'), -- no match
('aa.aa.jxpg'), -- no match
('aa.aa.jPg'),
('aa.aa.aico'), -- no match
('aa.aa.ico'),
('bb.cc.dd.icox'), -- no match
('bb.cc.dd.cco'); -- no match
insert into b values ('jpg'), ('ico');
Explanation:
in table a we have strings we would like to test (stored in column testedstr)
in table b we have strings we would to like to use as testing expresions (stored in column condstr)
SQL:
with cte as (select '\.(' || string_agg(condstr,'|') || ')$' condstr from b)
select * from a, cte where testedstr !~* condstr;
Explanation:
in the first line we will aggregate all patterns we would like to test into one string; as a result we will get jpg|ico string (aggregated into single row).
in the second line we crossjoin tested table with our testing expression (from the first line) and use regular expression to perform the test.
the regular expression at the end looks like \.(jpg|ico)$
For older versions, you should use answer provided by #Bohemian. For my sample data it would look like (adjusted for multiple possible dots) this (SQLFiddle:
select
*
from
a
where
lower(reverse(split_part(reverse(testedstr),'.',1)))
not in (select lower(condstr) from b)
Without reverse function (SQLFiddle):
select
*,
lower(split_part(testedstr,'.',length(testedstr)- length(replace(testedstr,'.','')) + 1)) as extension
from
a
where
lower(split_part(testedstr,'.',length(testedstr)- length(replace(testedstr,'.','')) + 1)) not in (select lower(condstr) from b)
First let's refactor the many conditions into just one:
where lower(substring(uri_stem from '[^.]+$')) not in ('js', 'css', 'gif', 'png', 'html', 'jpg', 'jpeg', 'ico', 'htm', 'pdf')
In this form, it's easy to see how the list of values can be selected instead of coded:
where lower(substring(uri_stem from '[^.]+$')) not in (
select lower(somecolumn) from sometable)
Note the use of lower() to avoid problems of dealing with variants of case.
You could also code it as a join:
select t1.*
from mytable t1
left join sometable t2
on lower(somecolumn) = lower(split_part(uri_stem, '.', 2))
where t2.somecolumn is null -- filter out matches

Searching individual words in a string

I know about full-text search, but that only matches your query against individual words. I want to select strings that contain a word that starts with words in my query. For example, if I search:
appl
the following should match:
a really nice application
apples are cool
appliances
since all those strings contains words that start with appl. In addition, it would be nice if I could select the number of words that match, and sort based on that.
How can I implement this in PostgreSQL?
Prefix matching with Full Text Search
FTS supports prefix matching. Your query works like this:
SELECT * FROM tbl
WHERE to_tsvector('simple', string) ## to_tsquery('simple', 'appl:*');
Note the appended :* in the tsquery. This can use an index.
See:
Get partial match from GIN indexed TSVECTOR column
Alternative with regular expressions
SELECT * FROM tbl
WHERE string ~ '\mappl';
Quoting the manual here:
\m .. matches only at the beginning of a word
To order by the count of matches, you could use regexp_matches()
SELECT tbl_id, count(*) AS matches
FROM (
SELECT tbl_id, regexp_matches(string, '\mappl', 'g')
FROM tbl
WHERE string ~ '\mappl'
) sub
GROUP BY tbl_id
ORDER BY matches DESC;
Or regexp_split_to_table():
SELECT tbl_id, string, count(*) - 1 AS matches
FROM (
SELECT tbl_id, string, regexp_split_to_table(string, '\mappl')
FROM tbl
WHERE string ~ '\mappl'
) sub
GROUP BY 1, 2
ORDER BY 3 DESC, 2, 1;
db<>fiddle here
Old sqlfiddle
Postgres 9.3 or later has index support for simple regular expressions with a trigram GIN or GiST index. The release notes for Postgres 9.3:
Add support for indexing of regular-expression searches in pg_trgm
(Alexander Korotkov)
See:
PostgreSQL LIKE query performance variations
Depesz wrote a blog about index support for regular expressions.
SELECT * FROM some_table WHERE some_field LIKE 'appl%' OR some_field LIKE '% appl%';
As for counting the number of words that match, I believe that would be too expensive to do dynamically in postgres (though maybe someone else knows better). One way you could do it is by writing a function that counts occurrences in a string, and then add ORDER BY myFunction('appl', some_field). Again though, this method is VERY expensive (i.e. slow) and not recommended.
For things like that, you should probably use a separate/complimentary full-text search engine like Sphinx Search (google it), which is specialized for that sort of thing.
An alternative to that, is to have another table that contains keywords and the number of occurrences of those keywords in each string. This means you need to store each phrase you have (e.g. really really nice application) and also store the keywords in another table (i.e. really, 2, nice, 1, application, 1) and link that keyword table to your full-phrase table. This means that you would have to break up strings into keywords as they are entered into your database and store them in two places. This is a typical space vs speed trade-off.