Compare columns in two tables and merge values in PostgreSQL

Compare columns in two tables and merge values in PostgreSQL - postgresql

I am given two tables share some column names, and somewhat similar rows. I need a method to compare the row entries and insert the value from one table to the other for the matching rows in one column. An example that describe my problem looks like following:
big table |small table
------------------------- |---------------------------
customer |address| | customer |address |
------------------------ | |------------------------------
John |123 Road | |customer John |123 Road
Jason |234 Dr | |shopper Jack |645 Pkway
Jack |Null | |customer Jones |789 Road
Jones |Null |
The small table can be viewed as data source for the big table, they contain same column names. I have tried join methods, but the method doesn't quite fit. The different naming conventions is causing problems. In this case, customer names in the big table in contained in the customer names in the small table. What I would like to achieve in above example is that the address of shopper Jack and customer Jonesin small table can be inserted into the address for Jack and Jones in the big table.
I hope my description is clear enough. Thank you.

Use Update...from with position of big table customer is in small table customer.
update big_table bt
set address = sm.address
from small_table sm
where bt.address is null
and position(bt.customer in sm.customer) > 0;
But be extremely cautious this is not a good practice. Would be much better to split small table customer into 2 columns. You also need to address what happens when both have an address for customer but they are different. And multiple customers with same name.

Related

Postgresql: More efficient way of joining tables based on multiple address fields

I have a table that lists two connected values, ID and TaxNumber(TIN) that looks somewhat like this:
IDTINMap
ID | TIN
-------------------------
1234567890 | 654321
-------------------------
3456321467 | 986321
-------------------------
8764932312 | 245234
An ID can map to multiple TINs, and a TIN might map to multiple IDs, but there is a Unique constraint on the table for an ID, TIN pair.
This list isn't complete, and the table has about 8000 rows. I have another table, IDListing that contains metadata for about 9 million IDs including name, address, city, state, postalcode, and the ID.
What I'm trying to do is build an expanded ID - TIN map. Currently I'm doing this by first joining the IDTINMap table with IDListing on the ID field, which gives something that looks like this in a CTE that I'll call Step1 right now:
ID | TIN | Name | Address | City | State | Zip
------------------------------------------------------------------------------------------------
1234567890 | 654321 | John Doe | 123 Easy St | Seattle | WA | 65432
------------------------------------------------------------------------------------------------
3456321467 | 986321 | Tyler Toe | 874 W 84th Ave| New York | NY | 48392
------------------------------------------------------------------------------------------------
8764932312 | 245234 | Jane Poe | 984 Oak Street|San Francisco | CA | 12345
Then I go through again and join the IDListing table again, joining Step1 on address, city, state, zip, and name all being equal. I know I could do something more complicated like fuzzy matching, but for right now we're just looking at exact matches. In the join I preserve the ID in step 1 as 'ReferenceID', keep the TIN, and then have another column of all the matching IDs. I don't keep any of the address/city/state/zip info, just the three numbers.
Then I can go back and insert all the distinct pairs into a final table.
I've tried this with a query and it works and gives me the desired result. However the query is slower than to be desired. I'm used to joining on rows that I've indexed (like ID or TIN) but it's slow to join on all of the address fields. Is there a good way to improve this? Joining on each field individually is faster than joining on a CONCAT() of all the fields (This I have tried). I'm just wondering if there is another way I can optimize this.

Make the final result a materialized view. Refresh it when you need to update the data (every night? every three hours?). Then use this view for your normal operations.

Does PostgreSQL have a way of creating metadata about the data in a particular table?

I'm dealing with a lot of unique data that has the same type of columns, but each group of rows have different attributes about them and I'm trying to see if PostgreSQL has a way of storing metadata about groups of rows in a database or if I would be better off adding custom columns to my current list of columns to track these different attributes. Microsoft Excel for instance has a way you can merge multiple columns into a super-column to group multiple columns into one, but I don't know how this would translate over to a PostgreSQL database. Thoughts anyone?
Right, can't upload files. Hope this turns out well.
Section 1 | Section 2 | Section 3
=================================
Num1|Num2 | Num1|Num2 | Num1|Num2
=================================
132 | 163 | 334 | 1345| 343 | 433
......
......
......

have a "super group" of columns (In SQL in general, not just postgreSQL), the easiest approach is to use multiple tables.
Example:
Person table can have columns of
person_ID, first_name, last_name
employee table can have columns of
person_id, department, manager_person_id, salary
customer table can have columns of
person_id, addr, city, state, zip
That way, you can join them together to do whatever you like..
Example:
select *
from person p
left outer join student s on s.person_id=p.person_id
left outer join employee e on e.person_id=p.person_id
Or any variation, while separating the data into different types and PERHAPS save a little disk space in the process (example if most "people" are "customers", they don't need a bunch of employee data floating around or have nullable columns)
That's how I normally handle this type of situation, but without a practical example, it's hard to say what's best in your scenario.

Postgis DB Structure: Identical tables for multiple years?

I have multiple datasets for different years as a shapefile and converted them to postgres tables, which confronts me with the following situation:
I got tables boris2018, boris2017, boris2016 and so on.
They all share an identical schema, for now let's focus on the following columns (example is one row out of the boris2018 table). The rows represent actual postgis geometry with certain properties.
brw | brwznr | gema | entw | nuta
-----+--------+---------+------+------
290 | 285034 | Sieglar | B | W
the 'brwznr' column is an ID of some kind, but it does not seem to be entirely consistent across all years for each geometry.
Then again, most of the tables contain duplicate information. The geometry should be identical in every year, although this is not guaranteed, too.
What I first did was to match the brwznr of each year with the 2018 data, adding a brw17, brw2016, ... column to my boris2018 data, like so:
brw18 | brw17 | brw16 | brwznr | gema | entw | nuta
-------+-------+-------+--------+---------+------+------
290 | 260 | 250 | 285034 | Sieglar | B | W
This led to some data getting lost (because there was no matching brwznr found), some data wrongly matched (because some matching was wrong due to inconsistencies in the data) and it didn't feel right.
What I actually want to achieve is having fast queries that get me the different brw values for a certain coordinate, something around the lines of
SELECT ortst, brw, gema, gena
FROM boris2018, boris2017, boris2016
WHERE ST_Intersects(geom,ST_SetSRID(ST_Point(7.130577, 50.80292),4326));
or
SELECT ortst, brw18, brw17, brw16, gema, gena
FROM boris
WHERE ST_Intersects(geom,ST_SetSRID(ST_Point(7.130577, 50.80292),4326));
although this obviously wrong/has its deficits.
Since I am new to databases in general, I can't really tell whether this is a querying problem or a database structure problem.
I hope anyone can help, your time and effort is highly appreciated!
Tim

Have you tried using a CTE?
WITH j AS (
SELECT ortst, brw, gema, gena FROM boris2016
UNION
SELECT ortst, brw, gema, gena FROM boris2017
UNION
SELECT ortst, brw, gema, gena FROM boris2018)
SELECT * FROM j
WHERE ST_Intersects(j.geom,ST_SetSRID(ST_Point(7.130577, 50.80292),4326));
Depending on your needs, you might wanna use UNION ALL. Note that this approach might not be fastest one when dealing with very large tables. If it is the case, consider merging the result of these three queries into another table and create an index using the geom field. Let me know in the comments if it is the case.

Versioning in the database

I want to store full versioning of the row every time a update is made for amount sensitive table.
So far, I have decided to use the following approach.
Do not allow updates.
Every time a update is made create a new
entry in the table.
However, I am undecided on what is the best database structure design for this change.
Current Structure
Primary Key: id
id(int) | amount(decimal) | other_columns
First Approach
Composite Primary Key: id, version
id(int) | version(int) | amount(decimal) | change_reason
1 | 1 | 100 |
1 | 2 | 20 | correction
Second Approach
Primary Key: id
Uniqueness Index on [origin_id, version]
id(int) | origin_id(int) | version(int) | amount(decimal) | change_reason
1 | NULL | 1 | 100 | NULL
2 | 1 | 2 | 20 | correction

I would suggest a new table which store unique id for item. This serves as lookup table for all available items.
item Table:
id(int)
1000
For the table which stores all changes for item, let's call it item_changes table. item_id is a FOREIGN KEY to item table's id. The relationship between item table to item_changes table, is one-to-many relationship.
item_changes Table:
id(int) | item_id(int) | version(int) | amount(decimal) | change_reason
1 | 1000 | 1 | 100 | NULL
2 | 1000 | 2 | 20 | correction
With this, item_id will never be NULL as it is a valid FOREIGN KEY to item table.

The best method is to use Version Normal Form (vnf). Here is an answer I gave for a neat way to track all changes to specific fields of specific tables.
The static table contains the static data, such as PK and other attributes which do not change over the life of the entity or such changes need not be tracked.
The version table contains all dynamic attributes that need to be tracked. The best design uses a view which joins the static table with the current version from the version table, as the current version is probably what your apps need most often. Triggers on the view maintain the static/versioned design without the app needing to know anything about it.
The link above also contains a link to a document which goes into much more detail including queries to get the current version or to "look back" at any version you need.

Why you are not going for SCD-2 (Slowly Changing Dimension), which is a rule/methodology to describe the best solution for your problem. Here is the SCD-2 advantage and example for using, and it makes standard design pattern for the database.
Type 2 - Creating a new additional record. In this methodology, all history of dimension changes is kept in the database. You capture attribute change by adding a new row with a new surrogate key to the dimension table. Both the prior and new rows contain as attributes the natural key(or other durable identifiers). Also 'effective date' and 'current indicator' columns are used in this method. There could be only one record with the current indicator set to 'Y'. For 'effective date' columns, i.e. start_date, and end_date, the end_date for current record usually is set to value 9999-12-31. Introducing changes to the dimensional model in type 2 could be very expensive database operation so it is not recommended to use it in dimensions where a new attribute could be added in the future.
id | amount | start_date |end_date |current_flag
1 100 01-Apr-2018 02-Apr-2018 N
2 80 04-Apr-2018 NULL Y
Detail Explanation::::
Here, all you need to add the 3 extra column, START_DATE, END_DATE, CURRENT_FLAG to track your record properly. When the first time record inserted # source, this table will be store the value as:
id | amount | start_date |end_date |current_flag
1 100 01-Apr-2018 NULL Y
And, when the same record will be updated then you have to update the "END_DATE" of the previous record as current_system_date and "CURRENT_FLAG" as "N", and insert the second record as below. So you can track everything about your records. as below...
id | amount | start_date |end_date |current_flag
1 100 01-Apr-2018 02-Apr-2018 N
2 80 04-Apr-2018 NULL Y

how to show error meassage using trigger in SAP HANA

I’m trying to create a trigger such as whenever I insert a new record in the Sales table, the Product table should update is “Inventory” based on Sales table “quantity”:
Product table Sales table
P_ID|QTY | |P_ID|QTY |
1 |10 | |1 |5 |
2 |15 |
Code:
create trigger "KABIL_PRACTICE"."SALES_TRIGGER"
after insert on "KABIL_PRACTICE"."SALES" REFERENCING NEW ROW AS newrow for each row
begin
update "KABIL_PRACTICE"."Inventory" set "Inventory" = "Inventory" - :newrow.QTY
where "P_ID" = :newrow.P_ID ;
end;
I get the expected result when I insert a record into the Sales table with P-ID 1 and quantity 5:
updated Product table Sales table
P_ID|QTY | |P_ID|QTY |
1 |5 | |1 |5 |
2 |15 | |1 |5 |
But if I insert a record into the Sales table again with P_ID 1 and quantity 6, the Sales table quantity is more than the available inventory quantity means it goes to negative value...
updated Product table Sales table
P_ID|QTY | |P_ID|QTY |
1 |-1 | |1 |5 |
2 |15 | |1 |5 |
|1 |6 |
I just want to intimate sales order quantity value is higher than available inventory quantity and it should not go to negative values... is there is any way to this...
I tried this code:
create trigger "KABIL_PRACTICE"."SALES_UPDATE_TRIGGER"
before insert on "KABIL_PRACTICE"."SALES" REFERENCING NEW ROW AS newrow for each row
begin
if("Inventory" > :newrow.QTY )
Then
update "KABIL_PRACTICE"."Inventory" set "Inventory" = "Inventory" - :newrow.QTY
where "P_ID" = :newrow.P_ID ;
elseif ("Inventory" < :newrow.QTY )
Then NULL;
delete "KABIL_PRACTICE"."SALES" where "QTY" = 0;
end;

The problem you have here is a classic. Usually the two business processes "SALES" and "ORDER FULFILLMENT" are separated, so the act of selling something would not have an immediate effect on the stock level. Instead, the order fulfilment could actually use other resources (e.g. back ordering from another vendor or producing more). That way the sale would be de-coupled from the current stock levels.
Anyhow, if you want to keep it a simple dependency of "only-sell-whats-available-right-now" then you need to consider the following:
multiple sales could be going on at the same time
what to do with sales that can only be partly fulfilled, e.g. should all available items be sold or should the whole order be handled as not able to fulfil?
To address the first point, again, different approaches can be taken. The easiest probably is to set a lock on the inventory records you are interested in as long as you make the decision(s) whether to process the order (and inventory transaction) or not.
SELECT QTY "KABIL_PRACTICE"."Inventory" WHERE P_ID = :P_ID FOR UPDATE;
This statement will acquire a lock on the relevant row(s) and return or wait until the lock gets available in case another session already holds it.
Once the quantity of an item is retrieved, you can call the further business logic (fulfil order completely, partly or decline).
Each of these application paths could be a stored procedure grouping the necessary steps.
By COMMITing the transaction the lock will get released.
As a general remark: this should not be implemented as triggers. Triggers should generally not be involved in application paths that could lead to locking situations in order to avoid system hang situations. Also, triggers don't really allow for a good understanding of the order in which statements get executed, which easily can lead to unwanted side effects.
Rather than triggers, stored procedures can provide an interface for applications to work with your data in a meaningful and safe way.
E.g.
procedure ProcessOrder
for each item in order
check stock and lock entry
(depending on business logic):
subtract all available items from stock to match order as much as possible
OR: only fulfil order items that can be fully provided and mark other items as not available. Reduce sales order SUM.
OR: decline the whole order.
COMMIT;
Your application can then simply call the procedure and take retrieve the outcome data (e.g. current order data, status,...) without having to worry about how the tables need to be updated.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse