How to optimize inverse pattern matching in Postgresql? - postgresql

I have Pg version 13.
CREATE TABLE test_schemes (
pattern TEXT NOT NULL,
some_code TEXT NOT NULL
);
Example data
----------- | -----------
pattern | some_code
----------- | -----------
__3_ | c1
__34 | c2
1_3_ | a12
_7__ | a10
7138 | a19
_123|123_ | a20
___253 | a28
253 | a29
2_1 | a30
This table have about 300k rows. I want to optimize simple query like
SELECT * FROM test_schemes where '1234' SIMILAR TO pattern
----------- | -----------
pattern | some_code
----------- | -----------
__3_ | c1
__34 | c2
1_3_ | a12
_123|123_ | a20
The problem is that this simple query will do a full scan of 300k rows to find all the matches. Given this design, how can I make the query faster (any use of special index)?

Internally, SIMILAR TO works similar to regexes, which would be evident by running an EXPLAIN on the query. You may want to just switch to regexes straight up, but it is also worth looking at text_pattern_ops indexes to see if you can improve the performance.

If the pipe is the only feature of SIMILAR TO (other than those present in LIKE) which you use, then you could process it into a form you can use with the much faster LIKE.
SELECT * FROM test_schemes where '1234' LIKE any(string_to_array(pattern,'|'))
In my hands this is about 25 times faster, and gives the same answer as your example on your example data (augmented with a few hundred thousand rows of garbage to get the table row count up to about where you indicated). It does assume there is no escaping of any pipes.
If you store the data already broken apart, it is about 3 times faster yet, but of course give cosmetically different answers.
create table test_schemes2 as select unnest as pattern, somecode from test_schemes, unnest(string_to_array(pattern,'|'));
SELECT * FROM test_schemes2 where '1234' LIKE pattern;

Related

Postgres Full Text Search - Find Other Similar Documents

I am looking for a way to use Postgres (version 9.6+) Full Text Search to find other documents similar to an input document - essentially looking for a way to produce similar results to Elasticsearch's more_like_this query. Far as I can tell Postgres offers no way to compare ts_vectors to each other.
I've tried various techniques like converting the source document back into a ts_query, or reprocessing the original doc but that requires too much overhead.
Would greatly appreciate any advice - thanks!
Looks like the only option is to use pg_trgm instead of the Postgres built in full text search. Here is how I ended up implementing this:
Using a simple table (or materialized view in this case) - it holds the primary key to the post and the full text body in two columns.
Materialized view "public.text_index"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------+---------+-----------+----------+---------+----------+--------------+-------------
id | integer | | | | plain | |
post | text | | | | extended | |
View definition:
SELECT posts.id,
posts.body AS text
FROM posts
ORDER BY posts.publication_date DESC;
Then using a lateral join we can match rows and order them by similarity to find posts that are "close to" or "related" to any other post:
select * from text_index tx
left join lateral (
select similarity(tx.text, t.text) from text_index t where t.id = 12345
) s on true
order by similarity
desc limit 10;
This of course is a naive way to match documents and may require further tuning. Additionally using a tgrm gin index on the text column will speed up the searches significantly.

Postgresql: More efficient way of joining tables based on multiple address fields

I have a table that lists two connected values, ID and TaxNumber(TIN) that looks somewhat like this:
IDTINMap
ID | TIN
-------------------------
1234567890 | 654321
-------------------------
3456321467 | 986321
-------------------------
8764932312 | 245234
An ID can map to multiple TINs, and a TIN might map to multiple IDs, but there is a Unique constraint on the table for an ID, TIN pair.
This list isn't complete, and the table has about 8000 rows. I have another table, IDListing that contains metadata for about 9 million IDs including name, address, city, state, postalcode, and the ID.
What I'm trying to do is build an expanded ID - TIN map. Currently I'm doing this by first joining the IDTINMap table with IDListing on the ID field, which gives something that looks like this in a CTE that I'll call Step1 right now:
ID | TIN | Name | Address | City | State | Zip
------------------------------------------------------------------------------------------------
1234567890 | 654321 | John Doe | 123 Easy St | Seattle | WA | 65432
------------------------------------------------------------------------------------------------
3456321467 | 986321 | Tyler Toe | 874 W 84th Ave| New York | NY | 48392
------------------------------------------------------------------------------------------------
8764932312 | 245234 | Jane Poe | 984 Oak Street|San Francisco | CA | 12345
Then I go through again and join the IDListing table again, joining Step1 on address, city, state, zip, and name all being equal. I know I could do something more complicated like fuzzy matching, but for right now we're just looking at exact matches. In the join I preserve the ID in step 1 as 'ReferenceID', keep the TIN, and then have another column of all the matching IDs. I don't keep any of the address/city/state/zip info, just the three numbers.
Then I can go back and insert all the distinct pairs into a final table.
I've tried this with a query and it works and gives me the desired result. However the query is slower than to be desired. I'm used to joining on rows that I've indexed (like ID or TIN) but it's slow to join on all of the address fields. Is there a good way to improve this? Joining on each field individually is faster than joining on a CONCAT() of all the fields (This I have tried). I'm just wondering if there is another way I can optimize this.
Make the final result a materialized view. Refresh it when you need to update the data (every night? every three hours?). Then use this view for your normal operations.

Postgis DB Structure: Identical tables for multiple years?

I have multiple datasets for different years as a shapefile and converted them to postgres tables, which confronts me with the following situation:
I got tables boris2018, boris2017, boris2016 and so on.
They all share an identical schema, for now let's focus on the following columns (example is one row out of the boris2018 table). The rows represent actual postgis geometry with certain properties.
brw | brwznr | gema | entw | nuta
-----+--------+---------+------+------
290 | 285034 | Sieglar | B | W
the 'brwznr' column is an ID of some kind, but it does not seem to be entirely consistent across all years for each geometry.
Then again, most of the tables contain duplicate information. The geometry should be identical in every year, although this is not guaranteed, too.
What I first did was to match the brwznr of each year with the 2018 data, adding a brw17, brw2016, ... column to my boris2018 data, like so:
brw18 | brw17 | brw16 | brwznr | gema | entw | nuta
-------+-------+-------+--------+---------+------+------
290 | 260 | 250 | 285034 | Sieglar | B | W
This led to some data getting lost (because there was no matching brwznr found), some data wrongly matched (because some matching was wrong due to inconsistencies in the data) and it didn't feel right.
What I actually want to achieve is having fast queries that get me the different brw values for a certain coordinate, something around the lines of
SELECT ortst, brw, gema, gena
FROM boris2018, boris2017, boris2016
WHERE ST_Intersects(geom,ST_SetSRID(ST_Point(7.130577, 50.80292),4326));
or
SELECT ortst, brw18, brw17, brw16, gema, gena
FROM boris
WHERE ST_Intersects(geom,ST_SetSRID(ST_Point(7.130577, 50.80292),4326));
although this obviously wrong/has its deficits.
Since I am new to databases in general, I can't really tell whether this is a querying problem or a database structure problem.
I hope anyone can help, your time and effort is highly appreciated!
Tim
Have you tried using a CTE?
WITH j AS (
SELECT ortst, brw, gema, gena FROM boris2016
UNION
SELECT ortst, brw, gema, gena FROM boris2017
UNION
SELECT ortst, brw, gema, gena FROM boris2018)
SELECT * FROM j
WHERE ST_Intersects(j.geom,ST_SetSRID(ST_Point(7.130577, 50.80292),4326));
Depending on your needs, you might wanna use UNION ALL. Note that this approach might not be fastest one when dealing with very large tables. If it is the case, consider merging the result of these three queries into another table and create an index using the geom field. Let me know in the comments if it is the case.

How to properly index strings for lookup and excepts, the PostgreSQL way

Due to infrastructure costs, I've been studying the possibility to migrate a few databases to PostgreSQL. So far I am loving it. But there are a few topics I am quite lost. I need some guidance on one of them.
I have an ETL process that queries "deltas" in my database and imports the new data. To do so, I use lookup tables that store hashbytes of some strings to facilitate the lookup. This works in SQL Server, but apparently things work quite differently in PostgreSQL. In SQL Server, using hashbytes + except is suggested when working with millions of rows.
Let's suppose the following table
+----+-------+------------------------------------------+
| Id | Name | hash_Name |
+----+-------+------------------------------------------+
| 1 | Mark | 31e9697d43a1a66f2e45db652019fb9a6216df22 |
| 2 | Pablo | ce7169ba6c7dea1ca07fdbff5bd508d4bb3e5832 |
| 3 | Mark | 31e9697d43a1a66f2e45db652019fb9a6216df22 |
+----+-------+------------------------------------------+
And my lookup table
+------------------------------------------+
| hash_Name |
+------------------------------------------+
| 31e9697d43a1a66f2e45db652019fb9a6216df22 |
+------------------------------------------+
When querying new data (Pablo's hash), I can advance from the simplified query bellow:
SELECT hash_name
FROM mytable
EXCEPT
SELECT hash_name
FROM mylookup
Thinking the PostgreSQL way, how could I achieve this? Should I index and use EXCEPT? Or is there a better way of doing so?
From my research, I couldn't find much regarding storing hashbytes. Apparently, it is a matter of creating indexes and choosing the right index for the job. More precisely: BTREE for single field indexes and GIN for multiple field indexes.

Select distinct rows from MongoDB

How do you select distinct records in MongoDB? This is a pretty basic db functionality I believe but I can't seem to find this anywhere else.
Suppose I have a table as follows
--------------------------
| Name | Age |
--------------------------
|John | 12 |
|Ben | 14 |
|Robert | 14 |
|Ron | 12 |
--------------------------
I would like to run something like SELECT DISTINCT age FROM names WHERE 1;
db.names.distinct('age')
Looks like there is a SQL mapping chart that I overlooked earlier.
Now is a good time to say that using a distinct selection isn't the best way to go around querying things. Either cache the list in another collection or keep your data set small.