Merge multiple tables having different columns - postgresql

I have 4 tables each tables has different number of columns as listed below.
tableA - 34
tableB - 47
tableC - 26
tableD - 16
Every table has a common column called id, now i need to perform a union the problem is since the columns are not equal length and entirely different i can't do a union.
Based on id only i can get the details from every table, so how should i approach this.
What is the optimized way to solve this, tried full join but that takes too much time.
Tried so far
SELECT * FROM tableA FULL JOIN
tableB FULL JOIN
tableC FULL JOIN
tableD
USING (id)
WHERE tableA.id = 123 OR
tableB.id = 123 OR
tableC.id = 123 OR
tableD.id = 123

Snowflake does have a declared limitation in use of Set operators (such as UNION):
When using these operators:
Make sure that each query selects the same number of columns.
[...]
However, since the column names are well known, it is possible to come up with a superset of all unique column names required in the final result and project them explicitly from each query.
There's not enough information in the question on how many columns overlap (47 unique columns?), or if they are all different (46 + 33 + 25 + 15 = 119 unique columns?). The answer to this would determine the amount of effort required to write out each query, as it would involve adapting a query from the following form:
SELECT * FROM t1
Into an explicit form with dummy columns defined with acceptable defaults that match the data type on tables where they are present:
SELECT
present_col1,
NULL AS absent_col2,
0.0 AS absent_col3,
present_col4,
[...]
FROM
t1
You can also use some meta programming with stored procedures to "generate" such an altered query by inspecting independent result's column names using the Statement::getColumnCount(), Statement::getColumnName(), etc. APIs and forming a superset union version with default/empty values.

Related

Best usage of indexes and primary key on joined and filtered data in PostgreSQL

I have 2 tables with the exact same number of rows and the same non-repeated id. Because the data comes from 2 sources I want to keep it 2 tables and not combine it. I assume the best approach would be to leave the unique id as the primary key and join on it?
SELECT * FROM tableA INNER JOIN tableB ON tableA primary key = tableB primary key
The data is used by an application that force the user to select 1 or many values from 5 drop downs in cascading order:
select 1 or many values from tableA column1.
select 1 or many values from tableA column2 but filtered from the first filter.
select 1 or many values from tableA column3 but filtered from the second filter which in turn is filtered from the first filter.
For example:
pk
Column 1
Column 2
Column 3
123
Doe
Jane
2022-01
234
Doe
Jane
2021-12
345
Doe
John
2022-03
456
Jones
Mary
2022-04
Selecting "Doe" from column1 would limit the second filter to ("Jane","John"). And selecting "Jane" from column2 would filter column3 to ("2022-01","2021-12")
And last part of the question;
The application have 3 selection options for column3:
picking the exact value (for example "2022-01") or picking the year ("2022") or picking the quarter that the month falls into ("Q1", which equates in "01","02","03").
What would be the best usage of indexes AND/OR additional columns for this scenario?
Volume of data would be 20-100 million rows.
Each filter is in the range of 5-25 distinct values.
Which version of Postgres do you operate?
The volume you state is rather daunting for such a use case of populating drop-down boxes using live data for a PG db.
No kidding, it's possible, Kibana/Elastic has even a filter widget that works exactly this way for instance.
My guess is you may consider storing the different combinations of search columns in another table simply to speed up populating the dropboxes. You can achieve that with triggers on the two main tables. So instead of additional columns/indexes you may end with an additional table ;)
Regarding indexing strategy and given the hints you stated (AND/OR), I'd say there's no silver bullet. Index the columns that will be queried the most often.
Index each column individually because Postgres starting from 11 IIRC can combine multiple indexes to answer conjunctive/disjunctive formulas in WHERE clauses.
Hope this helps

How handle NULL in visualization

I have two tables (it is example, of course) that I loaded to app from different sources by script
Table 1:
ID
Attribute T1
1
100
3
200
Table 2:
ID
Attribute T2
1
Value 1
2
Value 2
On a list I create table:
ID
Attribute T1
Attribute T2
Finally I have table
ID
Attribute T1
Attribute T2
1
100
Value 1
2
-
Value 2
3
200
-
So, as You know it limits me in filtering and analyzing data, for example I can't show all data that isn't represented in Table 1, or all data for Attribute T1 not equal 100.
I try to use NullAsValue, but it didn't help. Would be appreciate for idea how to manage my case.
To achieve what you're attempting, you'll need to Join or Concatenate your tables. The reason is because Null means something different depending on how the data is loaded.
There's basically two "types" of Null:
"Implied" Null
When you associate several tables in your data model, as you've done in your example, Qlik is essentially treating that as a natural outer join between the tables. But since it's not an actual join that happens when the script executes, the Nulls that arise from data incongruencies (like in your example) are basically implied, since there really is an absence of data there. There's nothing in the data or script that actually says "there are no Attribute T1 values for ID of 2." Because of that, you can't use a function like NullAsValue() or Coalesce() to replace Nulls with another value because those Nulls aren't even there -- there's nothing to actually replace.
The above tables don't have any actual Nulls -- just implied ones from their association and the fact that the ID fields in either table don't have all the same values.
"Realized" Null
If, instead of just using associations, you actually combine the tables using the Join or Concatenate prefixes, then Qlik is forced to actually generate a Null value in the absence of data. Instead of Null being implied, it's actually there in the data model -- it's been realized. In this case, we can actually use functions like NullAsValue() or Coalesce() or Alt() to replace Nulls with another value since we actually have something in our table to replace.
The above joined table has actual Nulls that are realized in the data model, so they can be replaced.
To replace Nulls at that point, you can use the NullAsValue() or Coalesce() functions like this in the Data Load Editor:
table1:
load * inline [
ID , Attribute T1
1 , 100
3 , 200
];
table2:
join load * inline [
ID , Attribute T2
1 , Value 1
2 , Value 2
];
NullAsValue [Attribute T1];
Set NullValue = '-NULL-';
new_table:
NoConcatenate load
ID
, [Attribute T1]
, Coalesce([Attribute T2], '-AlsoNULL-') as [Attribute T2]
Resident table1;
Drop Table table1;
That will result in a table like this:
The Coalesce() and Alt() functions are also available in chart expressions.
Here are some quick links to the things discussed here:
Qlik Null interpretation
Qlik table associations
NullAsValue() function
Coalesce() function
Alt() function

Loop a result set and feed two tables

I have a select query that returns a huge result set (500k records). But for this example let's say it has only two records:
SELECT * FROM INVENTORY I
INNER JOIN PARTS P
ON I.partcode = P.partcode
ORDER BY I.partcode
The result will look more or less like this:
pk partcode genericname partname stock
1 001 mouse logitech 10
2 002 keyboard genius 8
I have to loop the result above and feed two tables (product and variant).
I first have to insert two of the columns into 'product' table, like this:
INSERT INTO PRODUCT
(p_code,product_name) values (partcode,genericname)
pk p_code product_name
5 001 mouse
6 001 keyboard
Then I have to grab the pk that was automatically generated into the table above (say ppk) and then insert it together with the other two columns into the 'variant' table, like this:
INSERT INTO VARIANT
(product_pk,variant_name,in_stock) values (ppk,partname,stock)
pk product_pk variant_name in_stock
10 5 logitech 10
11 6 genius 8
At the end I should have the product and the variant tables with 2 records each.
I could write a VB code to do that but I think that it can de done in pure SQL, and I just am not sure the best approach.
Someone could give me some help with this?
Thank you!
You could use a SQL cursor to loop through and insert a row at a time into PRODUCT and then use SCOPE_IDENTITY() to get the newly assigned identity value to insert a corresponding row into VARIANT, but best practice is to avoid cursors if there's another way. (There usually is, but not always.)
If the partcode/genericname combination will uniquely identify 1 record in PRODUCT, you could do this:
INSERT INTO PRODUCT (p_code,product_name)
SELECT partcode, genenricname
FROM INVENTORY I INNER JOIN PARTS P ON I.partcode = P.partcode
(I would eliminate the ORDER BY from your query unless you care about the order the identity values are assigned.)
Then, run this:
INSERT INTO VARIANT
(product_pk,variant_name,in_stock)
SELECT pr.ppk, i.partname, i.stock
FROM inventory i INNER JOIN parts p ON i.partcode = p.partcode
INNER JOIN product pr on i.partcode = pr.p_code and i.genericname = pr.product_name
You may have to clean up the aliases between i and p in the 2nd query. I can't tell which table (inventory or parts) the variant_name and in_stock fields are coming from so I just used i.
Again - this assumes that partcode/genericname combination is unique in the PRODUCT table.

LEFT JOIN returns incorrect result in PostgreSQL

I have two tables: A (525,968 records) and B (517,831 records). I want to generate a table with all the rows from A and the matched records from B. Both tables has column "id" and column "year". The combination of id and year in table A is unique, but not in table B. So I wrote the following query:
SELECT
A.id,
A.year,
A.v1,
B.x1,
B.e1
FROM
A
LEFT JOIN B ON (A.id = B.id AND A.year = B.year);
I thought the result should contain the same total number of records in A, but it only returns about 517,950 records. I'm wondering what the possible cause may be.
Thanks!
First of all, I understand that this is an example, but postgres may hava an issues with capital letters in the table names.
Secondly, it may be a good idea to check how exactly you calculated 525,968 records. The thing is - if you use sime kind of client of database administration / queries - it may show you different / technical information about tables (there may be internal row counters in postgres that may actually differ from the number of records).
And finally to check yourself do something like
SELECT
count("A".id)
FROM
"A"

select distinct from 2 columns but only 1 is duplicate

select a.subscriber_msisdn, war.created_datetime from
(
select distinct subscriber_msisdn from wiz_application_response
where application_item_id in
(select id from wiz_application_item where application_id=155)
and created_datetime between '2012-10-07 00:00' and '2012-11-15 00:00:54'
) a
left outer join wiz_application_response war on (war.subscriber_msisdn=a.subscriber_msisdn)
the sub select returns 11 rows but when joined return 18 (with duplicates). The objective of this query is only add the date column to the 11 rows of the sub select.
Based on your description, it stands to reason that there are multiple created_datetime values for some of the subscriber_msisdn values which is what prompted you to use the distinct in the subquery to begin with. By joining the sub query to the original table you are defeating this. A cleaner way to write the query would be:
SELECT
war.subscriber_msisdn
, war.created_datetime
FROM
wiz_application_response war
LEFT JOIN wiz_application_item wai
ON war.application_item_id = wai.id
AND wai.application_id = 155
WHERE
war.created_datetime BETWEEN '2012-10-07 00:00' AND '2012-11-15 00:00:54'
This should return only the rows from the war table that satisfy the criteria based on the wai table. It should not be and outer join unless you wanted to return all the rows from war table that satisfied the created_datetime parameter regardless of the application_item_id parameter.
This is my best guess based on the limited information I have about your tables and what I’m assuming you’re trying to accomplish. If this doesn’t get you what you are after, I will continue to offer other ideas based on additional information you could provide. Hope this works.
Can most probably simplified to this:
SELECT DISTINCT ON (1)
r.subscriber_msisdn, r.created_datetime
FROM wiz_application_item i
JOIN wiz_application_response r ON r.application_item_id = i.id
WHERE i.application_id = 155
AND i.created_datetime BETWEEN '2012-10-07 00:00' AND '2012-11-15 00:00:54'
ORDER BY 1, 2 DESC -- to pick the latest created_datetime
Details depend on missing information.
More explanation here.