How to group by similar values with pg_trgm

How to group by similar values with pg_trgm - postgresql

I have the following table
id error
- ----------------------------------------
1 Error 1234eee5, can not write to disk
2 Error 83457qwe, can not write to disk
3 Error 72344ee, can not write to disk
4 Fatal barier breach on object 72fgsff
5 Fatal barier breach on object 7fasdfa
6 Fatal barier breach on object 73456xcc5
I want to be able to get a result that counts by similarity, where similarity of > 80% means two errors are equal. I've been using pg_trgm extension, and its similarity function works perfectly for me, the only thing I can figure out how to produce the grouping result below.
Error Count
------------------------------------- ------
Error 1234eee5, can not write to disk, 3
Fatal barier breach on object 72fgsff, 3

Basically you could join a table with itself to find similar strings, however this approach will end in a terribly slow query on a larger dataset. Also, using similarity() may cause inaccuracy in some cases (you need to find the appropriate limit value).
You should try to find patterns. For example, if all variable words in strings begin with a digit, you can mask them using regexp_replace():
select id, regexp_replace(error, '\d\w+', 'xxxxx') as error
from errors;
id | error
----+-------------------------------------
1 | Error xxxxx, can not write to disk
2 | Error xxxxx, can not write to disk
3 | Error xxxxx, can not write to disk
4 | Fatal barier breach on object xxxxx
5 | Fatal barier breach on object xxxxx
6 | Fatal barier breach on object xxxxx
(6 rows)
so you can easily group the data by error message:
select regexp_replace(error, '\d\w+', 'xxxxx') as error, count(*)
from errors
group by 1;
error | count
-------------------------------------+-------
Error xxxxx, can not write to disk | 3
Fatal barier breach on object xxxxx | 3
(2 rows)
The above query is only an example as the specific solution depends on the data format.
Using pg_trgm
The solution based on the OP's idea (see the comments below). The limit 0.8 for similarity() is certainly too high. It seems that it should be somewhere about 0.6.
The table for unique errors (I've used a temporary table but it also be a regular one of course):
create temp table if not exists unique_errors(
id serial primary key,
error text,
ids int[]);
The ids column is to store id of rows of the base table which contain similar errors.
do $$
declare
e record;
found_id int;
begin
truncate unique_errors;
for e in select * from errors loop
select min(id)
into found_id
from unique_errors u
where similarity(u.error, e.error) > 0.6;
if found_id is not null then
update unique_errors
set ids = ids || e.id
where id = found_id;
else
insert into unique_errors (error, ids)
values (e.error, array[e.id]);
end if;
end loop;
end $$;
The final results:
select *, cardinality(ids) as count
from unique_errors;
id | error | ids | count
----+---------------------------------------+---------+-------
1 | Error 1234eee5, can not write to disk | {1,2,3} | 3
2 | Fatal barier breach on object 72fgsff | {4,5,6} | 3
(2 rows)

For this particular case you could just group by left(error, 5), which would lead to two groups, one containing all the strings starting with Error, the other group containing all the strings starting with Fatal. This criteria would have to be updated if you are planning to add more error types.

Related

PostgreSQL 13 - Performance Improvement to delete large table data

I am using PostgreSQL 13 and has intermediate level experience with PostgreSQL.
I have a table named tbl_employee. it stores employee details for number of customers.
Below is my table structure, followed by datatype and index access method
Column | Data Type | Index name | Idx Access Type
-------------+-----------------------------+---------------------------+---------------------------
id | bigint | |
name | character varying | |
customer_id | bigint | idx_customer_id | btree
is_active | boolean | idx_is_active | btree
is_delete | boolean | idx_is_delete | btree
I want to delete employees for specific customer by customer_id.
In table I have total 18,00,000+ records.
When I execute below query for customer_id 1001 it returns 85,000.
SELECT COUNT(*) FROM tbl_employee WHERE customer_id=1001;
When I perform delete operation using below query for this customer then it takes 2 hours, 45 minutes to delete the records.
DELETE FROM tbl_employee WHERE customer_id=1001
Problem
My concern is that this query should take less than 1 min to delete the records. Is this normal to take such long time or is there any way we can optimise and reduce the execution time?
Below is Explain output of delete query
The values of seq_page_cost = 1 and random_page_cost = 4.
Below are no.of pages occupied by the table "tbl_employee" from pg_class.
Please guide. Thanks

During :
DELETE FROM tbl_employee WHERE customer_id=1001
Is there any other operation accessing this table? If only this SQL accessing this table, I don't think it will take so much time.

In RDBMS systems each SQL statement is also a transaction, unless it's wrapped in BEGIN; and COMMIT; to make multi-statement transactions.
It's possible your multirow DELETE statement is generating a very large transaction that's forcing PostgreSQL to thrash -- to spill its transaction logs from RAM to disk.
You can try repeating this statement until you've deleted all the rows you need to delete:
DELETE FROM tbl_employee WHERE customer_id=1001 LIMIT 1000;
Doing it this way will keep your transactions smaller, and may avoid the thrashing.

SQL: DELETE FROM tbl_employee WHERE customer_id=1001 LIMIT 1000;
will not work then.
To make the batch delete smaller, you can try this:
DELETE FROM tbl_employee WHERE ctid IN (SELECT ctid FROM tbl_employee where customer_id=1001 limit 1000)
Until there is nothing to delete.
Here the "ctid" is an internal column of Postgresql Tables. It can locate the rows.

PostgreSQL: on database restart, why is starting sequence number unpredictable?

OS: macOS 11.4 (Big Sur)
PostgreSQL: 13.4
I would expect the default behavior of sequence numbers (that is, auto-generated sequences used typically for PK generation on record-inserts) to be straightforward on server re-starts: namely, that sequence numbers always "start where they left off". If the last record inserted had an auto-sequenced ID of 5, then the next record-insert should have ID of 6. And so on.
But recently, more than once, I have observed less than desirable default behavior for sequence numbers. Here are two different observations, both presumably resulting from the same suspect behavior after database server re-starts:
Let's suppose the record in your table with ID of 1 was deleted, but that records with ID 2-5 exist. Then on server re-start, the sequence number started at 1. The first record insert works (that is, a record with ID of 1 was successfully inserted). Then the next few inserts result in PK-duplicate exceptions! Once the sequence number reaches 6, inserts start working again.
Again, let's suppose records in your table exist for IDs 2-5. Then after server re-start, the sequence number starts at some larger number, like 35! In this case, a large swath of IDs between (exclusively) 5-35 are unused (making it seem as if there were records that were deleted with those IDs).
This certainly seems awkward behavior. Is there some way to set up sequence numbers to avoid this behavior?
Sample sequence number from my database:
mydb=# \dS+ birthday_id_seq
Sequence "public.birthday_id_seq"
Type | Start | Minimum | Maximum | Increment | Cycles? | Cache
--------+-------+---------+---------------------+-----------+---------+-------
bigint | 1 | 1 | 9223372036854775807 | 1 | no | 1
mydb=# \dS+ birthdays
Table "public.birthdays"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------------+-----------------------------+-----------+----------+--------------------------------------+----------+--------------+-------------
id | bigint | | not null | nextval('birthday_id_seq'::regclass) | plain | |
birthdate | date | | | | plain | |
Indexes:
"birthdays_pkey" PRIMARY KEY, btree (id)
Access method: heap
mydb=# \d+
List of relations
Schema | Name | Type | Owner | Persistence | Size | Description
--------+---------------------+----------+-------------+-------------+------------+-------------
public | birthday_id_seq | sequence | kodecharlie | permanent | 8192 bytes |
(1 rows)

That is normal behavior:
Any sequence values that were already fetched by nextval, but never used in an INSERT that got committed, will be lost. That could happen if you perform a fast (or an immediate) shutdown while the INSERT was taking place.
Moreover, the first time you run nextval, PostgreSQL logs a WAL entry that consumes the next 32 values, so that it doesn't have to log each individual nextval. These values are lost after a restart.
As for the sequence going backwards after a restart:
Sequences, like all other objects, are WAL logged. WAL is guaranteed to be flushed during commit. Now if you start a transaction, fetch a sequence value and perform an insert, but don't commit the transaction yet, the changes to the sequence may still be in WAL buffers and not flushed to disk.
A crash that interrupts the transaction will cause the sequence to be reset to the last committed value, so you may get the same sequence number again. That is fine, because any sequence values fetched from the sequence since have not been committed either.
Which of the two behaviors you see depends on concurrent transactions: Typically, you will see missing values after a restart. But if you start a transaction, call nextval and crash the database without committing, you may see the same sequence value again after a restart.
You may want to read my article for more details.

Why does using DISTINCT ON () at different points in a query return different (unintuitive) results?

I’m querying from a table that has repeated uuids, and I want to remove duplicates. I also want to exclude some irrelevant data which requires joining on another table. I can remove duplicates and then exclude irrelevant data, or I can switch the order and exclude then remove duplicates. Intuitively, I feel like if anything, removing duplicates then joining should produce more rows than joining and then removing duplicates, but that is the opposite of what I’m seeing. What am I missing here?
In this one, I remove duplicates in the first subquery and filter in the second, and I get 500k rows:
with tbl1 as (
select distinct on (uuid) uuid, foreign_key
from original_data
where date > some_date
),
tbl2 as (
select uuid
from tbl1
left join other_data
on tbl1.foreign_key = other_data.id
where other_data.category <> something
)
select * from tbl2
If I filter then remove duplicates, I get 550k rows:
with tbl1 as (
select uuid, foreign_key
from original_data
where date > some_date
),
tbl2 as (
select uuid
from tbl1
left join other_data
on tbl1.foreign_key = other_data.id
where other_data.category <> something
),
tbl3 as (
select distinct on (uuid) uuid
from tbl2
)
select * from tbl3
Is there an explanation here?

Does original_data.foreign_key have a foreign key constraint referencing other_data.id allowing for foreign_keys that don't link to any id in other_data?
Isn't other_data.category or original_data.foreign_key column missing a NOT NULL constraint?
In either of these cases postgres would filter out all records with
a missing link (foreign_key=null)
a broken link (foregin_key doesn't match any id in other_data)
linking to an other_data record with a category set o null
in both of your approaches - regardless of whether they're a duplicate or not - as other_data.category <> something evaluates to null for them which does not satisfy the WHERE clause. That, combined with missing ORDER BY causing DISTINCT ON to drop different duplicates randomly each time, could result in dropping the duplicates that then get filtered out in tbl2 in the first approach, but not in the second.
Example:
pgsql122=# select * from original_data;
uuid | foreign_key | comment
------+-------------+---------------------------------------------------
1 | 1 | correct, non-duplicate record with a correct link
3 | 2 | duplicate record with a broken link
3 | 1 | duplicate record with a correct link
4 | null | duplicate record with a missing link
4 | 1 | duplicate record with a correct link
5 | 3 | duplicate record with a correct link, but a null category behind it
5 | 1 | duplicate record with a correct link
6 | null | correct, non-duplicate record with a missing link
7 | 2 | correct, non-duplicate record with a broken link
8 | 3 | correct, non-duplicate record with a correct link, but a null category behind it
pgsql122=# select * from other_data;
id | category
----+----------
1 | a
3 | null
Both of your approaches keep uuid 1 and eliminate uuid 6, 7 and 8 even though they're unique.
Your first approach randomly keeps between 0 and 3 out of the 3 pairs of duplicates (uuid 3, 4 and 5), depending on which one in each pair gets discarded by DISTINCT ON.
Your second approach always keeps one record for each uuid 3, 4 and 5. Each clone with missing link, a broken link or a link with a null category behind it is already gone by the time you discard duplicates.
As #a_horse_with_no_name suggested, ORDER BY should make DISTINCT ON consistent and predictable but only as long as records vary on the columns used for ordering. It also won't help if you have other issues, like the one I suggest.

Keyword search using PostgreSQL

I am trying to identify observations from my data using a list of keywords.However, the search results contains observations where only part of the keyword matches. For instance the keyword ice returns varices. I am using the following code
select *
from mytab
WHERE myvar similar to'%((ice)|(cool))%';
I tried the _tsquery and it does the exact match and does not include observations with varices. But this approach is taking significantly longer to query. (2 keyword search for similar to '% %' takes 5 secs, whereas _tsquerytakes 30 secs for 1 keyword search.I have more than 900 keywords to search)
select *
from mytab
where myvar ## to_tsquery(('ice'));
Is there a way to query multiple keywords using the _tsquery and any way to speed the querying process.

I'd suggest using keywords in a relational sense rather than having a running list of them under one field, which makes for terrible performance. Instead, you can have a table of keywords with id's as primary keys and have foreign keys referring to mytab's primary keys. So you'd end up with the following:
keywords table
id | mytab_id | keyword
----------------------
1 1 liver
2 1 disease
3 1 varices
4 2 ice
mytab table
id | rest of fields
---------------------
1 ....
2 ....
You can then do an inner join to find what keywords belong to the specified entries in mytab:
SELECT * FROM mytab
JOIN keywords ON keywords.mytab_id = mytab.id
WHERE keyword = 'ice'
You could also add a constraint to make sure the keyword and mytab_id pair is unique, that way you don't accidentally end up with the same keyword for the same entry in mytab.

Jmeter Issues with JDBC Request and variables

I'm having a few issues with Jmeter and storing/using variables from them:
I have a JDBC request which does a VERY simple "select statement" with the following sql:
select count(member_id) from member
This is then stored in a variable named count. I know what the count should be (should be 312), but the value count_1 gets is 40077. What is even more troubling is at some point, it started working and getting the correct count. Any idea what is going on?
In a seperate JDBC request, I retrieve a list of members:
select member_id from members
This is stored in a variable named members. Then I created a THIRD JDBC request to query and grab a random member:
select * from members where member_id = ?
In "Parameter values", I put in ${__V(member_${__Random(1,10)})} (note I put 10, not $count because I can't even get it to work correctly with a hard coded number). I see that this gets parsed correctly, but the error I get is:
org.postgresql.util.PSQLException: ERROR: invalid input syntax for integer: "member_7"
So it's not substituting the member_7 variable's value. Instead it's just passing the string. What am I doing wrong here?

If you have table member, where you have some member_id in this way (for example):
| member_id |
+-----------+
| 1 |
| 2 |
| 1 |
And you would like to count UNIQUE members from this table, you must use SELECT in this way:
SELECT COUNT(DISTINCT member_id) FROM member;
When you miss keyword DISTINCT, you will get only a COUNT of lines in the table.
The second SELECT you have to use in similar way:
SELECT DISTINCT member_id FROM member;
And the last question is, why you tried to integer value assign a value like 'member_7'?

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to group by similar values with pg_trgm - postgresql

For this particular case you could just group by left(error, 5), which would lead to two groups, one containing all the strings starting with Error, the other group containing all the strings starting with Fatal. This criteria would have to be updated if you are planning to add more error types.

Related

PostgreSQL 13 - Performance Improvement to delete large table data

PostgreSQL: on database restart, why is starting sequence number unpredictable?

Why does using DISTINCT ON () at different points in a query return different (unintuitive) results?

Keyword search using PostgreSQL

Jmeter Issues with JDBC Request and variables

Categories

Resources