Will huge table entries slow down query performance? - postgresql

Let's say I have a table persons that looks like this:
|id | name | age |
|---|------|-----|
|1 |foo |21 |
|2 |bar |22 |
|3 |baz |23 |
and add a new column history where I store a big JSON blob of, let's say ~4MB.
|id | name | age | history |
|---|------|-----|----------|
|1 |foo |21 |JSON ~ 4MB|
|2 |bar |22 |JSON ~ 4MB|
|3 |baz |23 |JSON ~ 4MB|
Will this negatively impact queries against this table overall?
What about queries like:
SELECT name FROM persons WHERE ... (Guess: This won't impact performance)
SELECT * FROM persons WHERE ... (Guess: This will impact performance as the database needs to read and send the big history entry)
Are there any other side effects like various growing caches etc. that could slow down database performance overall?

The JSON attribute will not be stored in the table itself, but in the TOAST table that belongs to the table, which is where all variable-length entries above a certain size are stored (and compressed).
Queries that do not read the JSON values won't affect performance at all, since the TOAST entries won't even be touched. Only if you read the JSON value performance will be affected, mostly because of the additional data read from storage and transmitted to the client, but of course the additional data will also reside in the database cache and compete with other data there.
So your guess is right.

depending on how many transactions and the type of transactions (Create, Read, Update, Delete) that are using this table there could be performance issues.
if you are updating the history lots, you will be doing a lot of updates transactions which will cause the table to reindex each update transaction.
let say table persons is called every time a user logs in and it also updates the history for that user. you are doing a select and update, if this is happening a lot it will cause lots of reindexing and could cause issues when users are logging on and other users are also updating history.
a better option would be to have a separate table for personupdates with
a relation to the person table.

Related

Postgresql Prune replicated data from outbox table

Problem Statement
In order to ensure disk size isn't growing unnecessary, I want to be able to delete rows that have been replicated from my outbox table.
Context
Postgres is at v12
We are using a Kafka source connector to stream changes made to a postgres table. These changes are insert only and thus are no longer needed once written to kafka. The source connector is using logical replication to stream the changes to the connector and the state of the replication can be displayed in pg_replication_slots.
When looking at the pg_replication_slots you can see useful data that it's storing in order to know what logs it has to keep to ensure replication can still happen for the client.
For example when I run:
select * from pg_replication_slots;
I might see:
slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
-----------+----------+-----------+--------+--------------------+-----------+--------+------------+------+--------------+-------------+---------------------
debezium | wal2json | logical | 26593 | database_name | f | t | 7404 | | 26729 | 0/DCD98E8 | 0/DCD9920
(1 row)
What I'm interested in knowing is if I can reliably use that data and then the postgresql metadata on the table to select all rows that have been replicated from that slot.
For example, this doesn't work as far as I can tell, but ideally would return rows that have been replicated and are now safe to prune from the table:
select * from outbox where age(xmin) < (select age(catalog_xmin) from pg_replication_slots);
Any guidance would be sweet! Cheers!
I have been implementing the Outbox pattern using Debezium with MySQL and delete the outbox record straight after inserting it which I saw done here https://debezium.io/blog/2019/02/19/reliable-microservices-data-exchange-with-the-outbox-pattern/ The insert is picked up and sent and the delete is ignored. So essentially there should never be anything in the outbox table(outside of the transaction).
I also pre-generate the primary keys for the entries(which I use for the event ID in Kafka) so I can bulk insert and delete.
Circling back around to this, I had to think a bit differently around how I we could tie the replications progress to our outbox table. Previously in my question I was trying to glean progress from pg_replication_slots, but in this working example I switched to using pg_stat_replication. This table can be queried by the slot_name we care about and can return lag results. For an example:
SELECT * FROM outbox WHERE created_at < (SELECT(NOW() - COALESCE(replay_lag, interval '60 seconds')) as stale_time from pg_stat_replication where pg_stat_replication.slot_name = 'outbox_slot');
So here this will return to us rows from our outbox table that were inserted outside of our replay_lag time or 1 minute.

Compare columns in two tables and merge values in PostgreSQL

I am given two tables share some column names, and somewhat similar rows. I need a method to compare the row entries and insert the value from one table to the other for the matching rows in one column. An example that describe my problem looks like following:
big table |small table
------------------------- |---------------------------
customer |address| | customer |address |
------------------------ | |------------------------------
John |123 Road | |customer John |123 Road
Jason |234 Dr | |shopper Jack |645 Pkway
Jack |Null | |customer Jones |789 Road
Jones |Null |
The small table can be viewed as data source for the big table, they contain same column names. I have tried join methods, but the method doesn't quite fit. The different naming conventions is causing problems. In this case, customer names in the big table in contained in the customer names in the small table. What I would like to achieve in above example is that the address of shopper Jack and customer Jonesin small table can be inserted into the address for Jack and Jones in the big table.
I hope my description is clear enough. Thank you.
Use Update...from with position of big table customer is in small table customer.
update big_table bt
set address = sm.address
from small_table sm
where bt.address is null
and position(bt.customer in sm.customer) > 0;
But be extremely cautious this is not a good practice. Would be much better to split small table customer into 2 columns. You also need to address what happens when both have an address for customer but they are different. And multiple customers with same name.

How to properly index strings for lookup and excepts, the PostgreSQL way

Due to infrastructure costs, I've been studying the possibility to migrate a few databases to PostgreSQL. So far I am loving it. But there are a few topics I am quite lost. I need some guidance on one of them.
I have an ETL process that queries "deltas" in my database and imports the new data. To do so, I use lookup tables that store hashbytes of some strings to facilitate the lookup. This works in SQL Server, but apparently things work quite differently in PostgreSQL. In SQL Server, using hashbytes + except is suggested when working with millions of rows.
Let's suppose the following table
+----+-------+------------------------------------------+
| Id | Name | hash_Name |
+----+-------+------------------------------------------+
| 1 | Mark | 31e9697d43a1a66f2e45db652019fb9a6216df22 |
| 2 | Pablo | ce7169ba6c7dea1ca07fdbff5bd508d4bb3e5832 |
| 3 | Mark | 31e9697d43a1a66f2e45db652019fb9a6216df22 |
+----+-------+------------------------------------------+
And my lookup table
+------------------------------------------+
| hash_Name |
+------------------------------------------+
| 31e9697d43a1a66f2e45db652019fb9a6216df22 |
+------------------------------------------+
When querying new data (Pablo's hash), I can advance from the simplified query bellow:
SELECT hash_name
FROM mytable
EXCEPT
SELECT hash_name
FROM mylookup
Thinking the PostgreSQL way, how could I achieve this? Should I index and use EXCEPT? Or is there a better way of doing so?
From my research, I couldn't find much regarding storing hashbytes. Apparently, it is a matter of creating indexes and choosing the right index for the job. More precisely: BTREE for single field indexes and GIN for multiple field indexes.

how to show error meassage using trigger in SAP HANA

I’m trying to create a trigger such as whenever I insert a new record in the Sales table, the Product table should update is “Inventory” based on Sales table “quantity”:
Product table Sales table
P_ID|QTY | |P_ID|QTY |
1 |10 | |1 |5 |
2 |15 |
Code:
create trigger "KABIL_PRACTICE"."SALES_TRIGGER"
after insert on "KABIL_PRACTICE"."SALES" REFERENCING NEW ROW AS newrow for each row
begin
update "KABIL_PRACTICE"."Inventory" set "Inventory" = "Inventory" - :newrow.QTY
where "P_ID" = :newrow.P_ID ;
end;
I get the expected result when I insert a record into the Sales table with P-ID 1 and quantity 5:
updated Product table Sales table
P_ID|QTY | |P_ID|QTY |
1 |5 | |1 |5 |
2 |15 | |1 |5 |
But if I insert a record into the Sales table again with P_ID 1 and quantity 6, the Sales table quantity is more than the available inventory quantity means it goes to negative value...
updated Product table Sales table
P_ID|QTY | |P_ID|QTY |
1 |-1 | |1 |5 |
2 |15 | |1 |5 |
|1 |6 |
I just want to intimate sales order quantity value is higher than available inventory quantity and it should not go to negative values... is there is any way to this...
I tried this code:
create trigger "KABIL_PRACTICE"."SALES_UPDATE_TRIGGER"
before insert on "KABIL_PRACTICE"."SALES" REFERENCING NEW ROW AS newrow for each row
begin
if("Inventory" > :newrow.QTY )
Then
update "KABIL_PRACTICE"."Inventory" set "Inventory" = "Inventory" - :newrow.QTY
where "P_ID" = :newrow.P_ID ;
elseif ("Inventory" < :newrow.QTY )
Then NULL;
delete "KABIL_PRACTICE"."SALES" where "QTY" = 0;
end;
The problem you have here is a classic. Usually the two business processes "SALES" and "ORDER FULFILLMENT" are separated, so the act of selling something would not have an immediate effect on the stock level. Instead, the order fulfilment could actually use other resources (e.g. back ordering from another vendor or producing more). That way the sale would be de-coupled from the current stock levels.
Anyhow, if you want to keep it a simple dependency of "only-sell-whats-available-right-now" then you need to consider the following:
multiple sales could be going on at the same time
what to do with sales that can only be partly fulfilled, e.g. should all available items be sold or should the whole order be handled as not able to fulfil?
To address the first point, again, different approaches can be taken. The easiest probably is to set a lock on the inventory records you are interested in as long as you make the decision(s) whether to process the order (and inventory transaction) or not.
SELECT QTY "KABIL_PRACTICE"."Inventory" WHERE P_ID = :P_ID FOR UPDATE;
This statement will acquire a lock on the relevant row(s) and return or wait until the lock gets available in case another session already holds it.
Once the quantity of an item is retrieved, you can call the further business logic (fulfil order completely, partly or decline).
Each of these application paths could be a stored procedure grouping the necessary steps.
By COMMITing the transaction the lock will get released.
As a general remark: this should not be implemented as triggers. Triggers should generally not be involved in application paths that could lead to locking situations in order to avoid system hang situations. Also, triggers don't really allow for a good understanding of the order in which statements get executed, which easily can lead to unwanted side effects.
Rather than triggers, stored procedures can provide an interface for applications to work with your data in a meaningful and safe way.
E.g.
procedure ProcessOrder
for each item in order
check stock and lock entry
(depending on business logic):
subtract all available items from stock to match order as much as possible
OR: only fulfil order items that can be fully provided and mark other items as not available. Reduce sales order SUM.
OR: decline the whole order.
COMMIT;
Your application can then simply call the procedure and take retrieve the outcome data (e.g. current order data, status,...) without having to worry about how the tables need to be updated.

How to merge table cells when adjacent cell empty?

I'm looking for a way to merge cells, but only when a condition is true. Other suggestions to my problem are fine, too.
Background: I need to create a Jasper report for which I got a design/layout specification. All data is provided through a single stored procedure.
The layout is mostly a simple table with data, but some rows differ from the rest and contain some sort of interim report data (that's not calculated from the previous values). Those rows also differ in the row layout. Number of rows before and after are dynamic.
Example:
------------------------------------
| data | data | data | data | data |
------------------------------------
| data | data | data | data | data |
------------------------------------
| data | data | data | data | data |
------------------------------------
| some text | abc | def |
------------------------------------
| data | data | data | data | data |
------------------------------------
| different text | xyz |
------------------------------------
The procedure delivers all of this data in a single data set, including the text of those special rows. For those cell that should be merge with their left adjacent cell the procedure returns NULL, all other cells always contain some sort of data.
Now I could need some help to actually merge those cells. If there are other/better ways to achieve the given layout, feel free to suggest those.
Unfortunately I have no control over the stored procedure, but slight alterations might be possible.