Best usage of indexes and primary key on joined and filtered data in PostgreSQL - postgresql

I have 2 tables with the exact same number of rows and the same non-repeated id. Because the data comes from 2 sources I want to keep it 2 tables and not combine it. I assume the best approach would be to leave the unique id as the primary key and join on it?
SELECT * FROM tableA INNER JOIN tableB ON tableA primary key = tableB primary key
The data is used by an application that force the user to select 1 or many values from 5 drop downs in cascading order:
select 1 or many values from tableA column1.
select 1 or many values from tableA column2 but filtered from the first filter.
select 1 or many values from tableA column3 but filtered from the second filter which in turn is filtered from the first filter.
For example:
pk
Column 1
Column 2
Column 3
123
Doe
Jane
2022-01
234
Doe
Jane
2021-12
345
Doe
John
2022-03
456
Jones
Mary
2022-04
Selecting "Doe" from column1 would limit the second filter to ("Jane","John"). And selecting "Jane" from column2 would filter column3 to ("2022-01","2021-12")
And last part of the question;
The application have 3 selection options for column3:
picking the exact value (for example "2022-01") or picking the year ("2022") or picking the quarter that the month falls into ("Q1", which equates in "01","02","03").
What would be the best usage of indexes AND/OR additional columns for this scenario?
Volume of data would be 20-100 million rows.
Each filter is in the range of 5-25 distinct values.

Which version of Postgres do you operate?
The volume you state is rather daunting for such a use case of populating drop-down boxes using live data for a PG db.
No kidding, it's possible, Kibana/Elastic has even a filter widget that works exactly this way for instance.
My guess is you may consider storing the different combinations of search columns in another table simply to speed up populating the dropboxes. You can achieve that with triggers on the two main tables. So instead of additional columns/indexes you may end with an additional table ;)
Regarding indexing strategy and given the hints you stated (AND/OR), I'd say there's no silver bullet. Index the columns that will be queried the most often.
Index each column individually because Postgres starting from 11 IIRC can combine multiple indexes to answer conjunctive/disjunctive formulas in WHERE clauses.
Hope this helps

Related

Convert certain columns that are recurring to rows in postgres

I have a table in Postgres that has ~50 million rows. I need to convert certain columns to rows.
I need to unpivot certain columns for individuals that repeat as an individual column and repeat the non-individual variables against the respective ID -
The following is the output I need -
Id appreciate any help on this.
50 million rows is not a big deal in Greenplum but returning that many rows to a client is kind of pointless. I'm guessing you want to create a new table for this new output. You are also going to be creating a table that is 2x larger because you are taking a single row and turning it into 2.
create table new_table as
select id, mid_1 as mid, name_1 as name, age_1 as age, location
from your_table
union all
select id, mid_2 as mid, name_2 as name, age_2 as age, location
from your_table
distributed by (id);

Loop a result set and feed two tables

I have a select query that returns a huge result set (500k records). But for this example let's say it has only two records:
SELECT * FROM INVENTORY I
INNER JOIN PARTS P
ON I.partcode = P.partcode
ORDER BY I.partcode
The result will look more or less like this:
pk partcode genericname partname stock
1 001 mouse logitech 10
2 002 keyboard genius 8
I have to loop the result above and feed two tables (product and variant).
I first have to insert two of the columns into 'product' table, like this:
INSERT INTO PRODUCT
(p_code,product_name) values (partcode,genericname)
pk p_code product_name
5 001 mouse
6 001 keyboard
Then I have to grab the pk that was automatically generated into the table above (say ppk) and then insert it together with the other two columns into the 'variant' table, like this:
INSERT INTO VARIANT
(product_pk,variant_name,in_stock) values (ppk,partname,stock)
pk product_pk variant_name in_stock
10 5 logitech 10
11 6 genius 8
At the end I should have the product and the variant tables with 2 records each.
I could write a VB code to do that but I think that it can de done in pure SQL, and I just am not sure the best approach.
Someone could give me some help with this?
Thank you!
You could use a SQL cursor to loop through and insert a row at a time into PRODUCT and then use SCOPE_IDENTITY() to get the newly assigned identity value to insert a corresponding row into VARIANT, but best practice is to avoid cursors if there's another way. (There usually is, but not always.)
If the partcode/genericname combination will uniquely identify 1 record in PRODUCT, you could do this:
INSERT INTO PRODUCT (p_code,product_name)
SELECT partcode, genenricname
FROM INVENTORY I INNER JOIN PARTS P ON I.partcode = P.partcode
(I would eliminate the ORDER BY from your query unless you care about the order the identity values are assigned.)
Then, run this:
INSERT INTO VARIANT
(product_pk,variant_name,in_stock)
SELECT pr.ppk, i.partname, i.stock
FROM inventory i INNER JOIN parts p ON i.partcode = p.partcode
INNER JOIN product pr on i.partcode = pr.p_code and i.genericname = pr.product_name
You may have to clean up the aliases between i and p in the 2nd query. I can't tell which table (inventory or parts) the variant_name and in_stock fields are coming from so I just used i.
Again - this assumes that partcode/genericname combination is unique in the PRODUCT table.

How to sort table alphabetically by name initial?

I have a table contains columns 'employeename' and 'id', how can I sort the 'employeename' column following alphabetical order of the names initial?
Say the table is like this now:
employeename rid eid
Dave 1 1
Ben 4 2
Chloe 6 6
I tried the command ORDER BY, it shows what I want but when I query the data again by SELECT, the showed table data is the same as original, indicting ORDER BY does not modify the data, is this correct?
SELECT *
FROM employee
ORDER BY employeename ASC;
I expect the table data to be modified (sorted by names alphabetical order) like this:
employeename rid eid
Ben 4 2
Chloe 6 6
Dave 1 1
the showed table data is the same as original, indicting ORDER BY does not modify the data, is this correct?
Yes, this is correct. A SELECT statement does not change the data in a table. Only UPDATE, DELETE, INSERT or TRUNCATE statements will change the data.
However, your question shows a misconception on how a relational database works.
Rows in a table (of a relational database) are not sorted in any way. You can picture them as balls in a basket.
If you want to display data in a specific sort order, the only (really: the only) way to do that is to use an ORDER BY in your SELECT statement. There is no alternative to that.
Postgres allows to define a VIEW that includes an ORDER BY which might be an acceptable workaround for you:
CREATE VIEW sorted_employee;
AS
SELECT *
FROM employee
ORDER BY employeename ASC;
Then you can simply use
select *
from sorted_employees;
But be aware of the drawbacks. If you run select * from sorted_employees order by id then the data will be sorted twice. Postgres is not smart enough to remove the (useless) order by from the view's definition.
Some related questions:
Default row order in SELECT query - SQL Server 2008 vs SQL 2012
What is the default SQL result sort order with 'select *'?
Is PostgreSQL order fully guaranteed if sorting on a non-unique attribute?
Why do results from a SQL query not come back in the order I expect?

Does PostgreSQL have a way of creating metadata about the data in a particular table?

I'm dealing with a lot of unique data that has the same type of columns, but each group of rows have different attributes about them and I'm trying to see if PostgreSQL has a way of storing metadata about groups of rows in a database or if I would be better off adding custom columns to my current list of columns to track these different attributes. Microsoft Excel for instance has a way you can merge multiple columns into a super-column to group multiple columns into one, but I don't know how this would translate over to a PostgreSQL database. Thoughts anyone?
Right, can't upload files. Hope this turns out well.
Section 1 | Section 2 | Section 3
=================================
Num1|Num2 | Num1|Num2 | Num1|Num2
=================================
132 | 163 | 334 | 1345| 343 | 433
......
......
......
have a "super group" of columns (In SQL in general, not just postgreSQL), the easiest approach is to use multiple tables.
Example:
Person table can have columns of
person_ID, first_name, last_name
employee table can have columns of
person_id, department, manager_person_id, salary
customer table can have columns of
person_id, addr, city, state, zip
That way, you can join them together to do whatever you like..
Example:
select *
from person p
left outer join student s on s.person_id=p.person_id
left outer join employee e on e.person_id=p.person_id
Or any variation, while separating the data into different types and PERHAPS save a little disk space in the process (example if most "people" are "customers", they don't need a bunch of employee data floating around or have nullable columns)
That's how I normally handle this type of situation, but without a practical example, it's hard to say what's best in your scenario.

Optimal use of LIKE on indexed column

I have a large table (+- 1 million rows, 7 columns including the primary key). The table contains two columns (ie: symbol_01 and symbol_02) that are indexed and used for querying. This table contains rows such as:
id symbol_01 symbol_02 value_01 value_02
1 aaa bbb 12 15
2 bbb aaa 12 15
3 ccc ddd 20 50
4 ddd ccc 20 50
As per the example rows 1 and 2 are identical except that symbol_01 and symbol_02 are swapped but they have the same values for value_01 and value_02. That is true once again with row 3 and 4. This is the case for the entire table, there are essentially two rows for each combination of symbol_01+symbol_02.
I need to figure out a better way of handling this to get rid of the duplication. So far the solution I am considering is to just have one column called symbol which would be a combination of the two symbols, so the table would be as follows:
id symbol value_01 value_02
1 ,aaa,bbb, 12 15
2 ,ccc,ddd, 20 50
This would cut the number of rows in half. As a side note, every value in the symbol column will be unique. Results always need to be queried for using both symbols, so I would do:
select value_01, value_02
from my_table
where symbol like '%,aaa,%' and symbol like '%,bbb,%'
This would work but my question is around performance. This is still going to be a big table (and will get bigger soon). So my question is, is this the best solution for this scenario given that symbol will be indexed, every symbol combination will be unique, and I will need to use LIKE to query results.
Is there a better way to do this? Im not sure how great LIKE is for performance but I don't see an alternative?
There's no high performance solution, because your problem is shoehorning multiple values into one column.
Create a child table (with a foreign key to your current/main table) to separately hold all the individual values you want to search on, index that column and your query will be simple and fast.
With this index:
create index symbol_index on t (
least(symbol_01, symbol_02),
greatest(symbol_01, symbol_02)
)
The query would be:
select *
from t
where
least(symbol_01, symbol_02) = least('aaa', 'bbb')
and
greatest(symbol_01, symbol_02) = greatest('aaa', 'bbb')
Or simply delete the duplicates:
delete from t
using (
select distinct on (
greatest(symbol_01, symbol_02),
least(symbol_01, symbol_02),
value_01, value_02
) id
from t
order by
greatest(symbol_01, symbol_02),
least(symbol_01, symbol_02),
value_01, value_02
) s
where id = s.id
Depending on the columns semantics it might be better to normalize the table as suggested by #Bohemian