Postgresql: More efficient way of joining tables based on multiple address fields - postgresql

I have a table that lists two connected values, ID and TaxNumber(TIN) that looks somewhat like this:
IDTINMap
ID | TIN
-------------------------
1234567890 | 654321
-------------------------
3456321467 | 986321
-------------------------
8764932312 | 245234
An ID can map to multiple TINs, and a TIN might map to multiple IDs, but there is a Unique constraint on the table for an ID, TIN pair.
This list isn't complete, and the table has about 8000 rows. I have another table, IDListing that contains metadata for about 9 million IDs including name, address, city, state, postalcode, and the ID.
What I'm trying to do is build an expanded ID - TIN map. Currently I'm doing this by first joining the IDTINMap table with IDListing on the ID field, which gives something that looks like this in a CTE that I'll call Step1 right now:
ID | TIN | Name | Address | City | State | Zip
------------------------------------------------------------------------------------------------
1234567890 | 654321 | John Doe | 123 Easy St | Seattle | WA | 65432
------------------------------------------------------------------------------------------------
3456321467 | 986321 | Tyler Toe | 874 W 84th Ave| New York | NY | 48392
------------------------------------------------------------------------------------------------
8764932312 | 245234 | Jane Poe | 984 Oak Street|San Francisco | CA | 12345
Then I go through again and join the IDListing table again, joining Step1 on address, city, state, zip, and name all being equal. I know I could do something more complicated like fuzzy matching, but for right now we're just looking at exact matches. In the join I preserve the ID in step 1 as 'ReferenceID', keep the TIN, and then have another column of all the matching IDs. I don't keep any of the address/city/state/zip info, just the three numbers.
Then I can go back and insert all the distinct pairs into a final table.
I've tried this with a query and it works and gives me the desired result. However the query is slower than to be desired. I'm used to joining on rows that I've indexed (like ID or TIN) but it's slow to join on all of the address fields. Is there a good way to improve this? Joining on each field individually is faster than joining on a CONCAT() of all the fields (This I have tried). I'm just wondering if there is another way I can optimize this.

Make the final result a materialized view. Refresh it when you need to update the data (every night? every three hours?). Then use this view for your normal operations.

Related

Compare columns in two tables and merge values in PostgreSQL

I am given two tables share some column names, and somewhat similar rows. I need a method to compare the row entries and insert the value from one table to the other for the matching rows in one column. An example that describe my problem looks like following:
big table |small table
------------------------- |---------------------------
customer |address| | customer |address |
------------------------ | |------------------------------
John |123 Road | |customer John |123 Road
Jason |234 Dr | |shopper Jack |645 Pkway
Jack |Null | |customer Jones |789 Road
Jones |Null |
The small table can be viewed as data source for the big table, they contain same column names. I have tried join methods, but the method doesn't quite fit. The different naming conventions is causing problems. In this case, customer names in the big table in contained in the customer names in the small table. What I would like to achieve in above example is that the address of shopper Jack and customer Jonesin small table can be inserted into the address for Jack and Jones in the big table.
I hope my description is clear enough. Thank you.
Use Update...from with position of big table customer is in small table customer.
update big_table bt
set address = sm.address
from small_table sm
where bt.address is null
and position(bt.customer in sm.customer) > 0;
But be extremely cautious this is not a good practice. Would be much better to split small table customer into 2 columns. You also need to address what happens when both have an address for customer but they are different. And multiple customers with same name.

Versioning in the database

I want to store full versioning of the row every time a update is made for amount sensitive table.
So far, I have decided to use the following approach.
Do not allow updates.
Every time a update is made create a new
entry in the table.
However, I am undecided on what is the best database structure design for this change.
Current Structure
Primary Key: id
id(int) | amount(decimal) | other_columns
First Approach
Composite Primary Key: id, version
id(int) | version(int) | amount(decimal) | change_reason
1 | 1 | 100 |
1 | 2 | 20 | correction
Second Approach
Primary Key: id
Uniqueness Index on [origin_id, version]
id(int) | origin_id(int) | version(int) | amount(decimal) | change_reason
1 | NULL | 1 | 100 | NULL
2 | 1 | 2 | 20 | correction
I would suggest a new table which store unique id for item. This serves as lookup table for all available items.
item Table:
id(int)
1000
For the table which stores all changes for item, let's call it item_changes table. item_id is a FOREIGN KEY to item table's id. The relationship between item table to item_changes table, is one-to-many relationship.
item_changes Table:
id(int) | item_id(int) | version(int) | amount(decimal) | change_reason
1 | 1000 | 1 | 100 | NULL
2 | 1000 | 2 | 20 | correction
With this, item_id will never be NULL as it is a valid FOREIGN KEY to item table.
The best method is to use Version Normal Form (vnf). Here is an answer I gave for a neat way to track all changes to specific fields of specific tables.
The static table contains the static data, such as PK and other attributes which do not change over the life of the entity or such changes need not be tracked.
The version table contains all dynamic attributes that need to be tracked. The best design uses a view which joins the static table with the current version from the version table, as the current version is probably what your apps need most often. Triggers on the view maintain the static/versioned design without the app needing to know anything about it.
The link above also contains a link to a document which goes into much more detail including queries to get the current version or to "look back" at any version you need.
Why you are not going for SCD-2 (Slowly Changing Dimension), which is a rule/methodology to describe the best solution for your problem. Here is the SCD-2 advantage and example for using, and it makes standard design pattern for the database.
Type 2 - Creating a new additional record. In this methodology, all history of dimension changes is kept in the database. You capture attribute change by adding a new row with a new surrogate key to the dimension table. Both the prior and new rows contain as attributes the natural key(or other durable identifiers). Also 'effective date' and 'current indicator' columns are used in this method. There could be only one record with the current indicator set to 'Y'. For 'effective date' columns, i.e. start_date, and end_date, the end_date for current record usually is set to value 9999-12-31. Introducing changes to the dimensional model in type 2 could be very expensive database operation so it is not recommended to use it in dimensions where a new attribute could be added in the future.
id | amount | start_date |end_date |current_flag
1 100 01-Apr-2018 02-Apr-2018 N
2 80 04-Apr-2018 NULL Y
Detail Explanation::::
Here, all you need to add the 3 extra column, START_DATE, END_DATE, CURRENT_FLAG to track your record properly. When the first time record inserted # source, this table will be store the value as:
id | amount | start_date |end_date |current_flag
1 100 01-Apr-2018 NULL Y
And, when the same record will be updated then you have to update the "END_DATE" of the previous record as current_system_date and "CURRENT_FLAG" as "N", and insert the second record as below. So you can track everything about your records. as below...
id | amount | start_date |end_date |current_flag
1 100 01-Apr-2018 02-Apr-2018 N
2 80 04-Apr-2018 NULL Y

Is it possible to use different forms and create one row of information in a table?

I have been searching for a way to combine two or more rows of one table in a database into one row.
I am currently creating multiple web-based forms that connect to one table in my database. Is there any way to write some mysql and php code that will take separate form submissions and put them into one row of the database instead of multiple rows?
Here is an example of what is going into the database:
This is all in one table with three rows.
Form_ID represents the three different forms that I used to insert the data into the table.
Form_ID | Lot_ID| F_Name | L_Name | Date | Age
------------------------------------------------------------
1 | 1 | John | Evans | *NULL* | *NULL*
-------------------------------------------------------------
2 |*NULL* | *NULL* | *NULL* | 2017-07-06 | *NULL*
-------------------------------------------------------------
3 |*NULL* | *NULL* | *NULL* | *NULL* | 22
This is an example of three separate forms going into one table. Every time the submit button is hit the data just inserts down to the next row of information.
I need some sort of join or update once the submit button is hit to replace the preceding NULL values.
Here is what I want to do after the submit button is hit:
I want it to be combined all into one row but still in one table
Form_ID is still the three separate forms but only in one row now.
Form_ID |Lot_ID | F_Name | L_Name | Date | Age
----------------------------------------------------------
1 | 1 | John | Evans | 2017-07-06 | 22
My goal is once a one form has been submitted I want the next, different form submission to replace the NULL values in the row above it and so on to create a single row of information.
I found a way to solve this issue. I used UPDATE tablename SET columname = newColumnName WHERE Form_ID = newID
So this way when I want to update rows that have blanks spaces I have it finding the matching ID's

Using filtered results as field for calculated field in Tableau

I have a table that looks like this:
+------------+-----------+---------------+
| Invoice_ID | Charge_ID | Charge_Amount |
+------------+-----------+---------------+
| 1 | A | $10 |
| 1 | B | $20 |
| 2 | A | $10 |
| 2 | B | $20 |
| 2 | C | $30 |
| 3 | C | $30 |
| 3 | D | $40 |
+------------+-----------+---------------+
In Tableau, how can I have a field that SUMs the Charge_Amount for the Charge_IDs B, C and D, where the invoice has a Charge_ID of A? The result would be $70.
My datasource is SQL Server, so I was thinking that I could add a field (called Has_ChargeID_A) to the SQL Server Table that tells if the invoice has a Charge_ID of A, and then in Tableau just do a SUM of all the rows where Has_ChargeID_A is true and Charge_ID is either B, C or D. But I would prefer if I can do this directly in Tableau (not this exactly, but anything that will get me to the same result).
Your intuition is steering you in the right direction. You do want to filter to only Invoices that contain row with a Charge_ID of A, and you can do this directly in Tableau.
First place Invoice_ID on the filter shelf, then select the Condition tab for the filter. Then select the "By formula" option on the condition tab and enter the formula you wish to use to determine which invoice_ids are included by the filter.
Here is a formula for your example:
count(if Charge_ID = 'A' then 'Y' end) > 0
For each data row, it will calculate the value of the expression inside the parenthesis, and then only include invoice_ids with at least one non-null value for the internal expression. (The implicit else for the if statement, "returns" null).
The condition tab for a dimension field equates to a HAVING clause in SQL.
If condition formulas get complex, it's often a good a idea to define them with a calculated field -- or a combination of several simpler calculated fields, just to keep things manageable.
Finally, if you end up working with sets of dimensions like this frequently, you can define them as sets. You can still drop sets on the filter shelf, but then can reuse them in other ways: like testing set membership in a calculated field (like a SQL IN clause), or by creating new sets using intersection and union operators. You can think of sets like named filters, such as the set of invoices that contain type A charge.

Derived Associations with Entity Framework

I've just started with Entity Framework this week, and am struggling with a few of the concepts.
Right now, I have a database structure that I am struggling to transfer across to entity framework.
I have started with the model first, and have this:
------------------ -----------------------
| Order_Item | | Order_FetchableItem |
---------------- ---------------------
| order_id | | order_id |
| item_id | | item_id |
------------------ | fetch_url |
-----------------------
The idea is that orders contain items, and this relation is conveyed in the order_item table. HOWEVER, some (not all) of the items in an order have a URL, so this property needs to be stored too.
I can't get this working in EF, because EF detects Order_Item as a relation, and I can't derive from it. What's the best alternative for doing this?
I have considered moving the fetch_url field to the Order_Item table, but as it is a wide column, I don't want lots of NULL values in the order_item table.
Thanks, and please excuse the formatting above!