Designing a database to save and query a dynamic range? - postgresql

I need to design a (postgres) database table which can save a dynamic range of something.
Example:
We have a course table. Each course can have (a minimum AND a maximum) OR (a specific amount) of participants.
A math course can be started with 4 to 10 students while a physics course needs to have exactly 8 students to start.
After that, I want to be able to query that.
Let's say, I want all courses which can take 6 students. The math course should be returned, the physics course shouldn't as it requires exactly 8 students.
When I query for 8 students, both courses should be returned.
For the implementation I thought about two simple fields: min_students and max_students. Then I could simply check if the number is equal to or between these numbers.
The issue is: I have to fill both columns everytime. Also for the physics course which requires exactly 8 students.
example:
name | min_students | max_students
--------|--------------|-------------
math | 4 | 10
physics | 8 | 8
Is there a more elegant/efficient way? I also thought about making the max_students column nullable so I could check for
min_students = X OR (min_students >= X AND max_students <= Y)
Would that be more efficient? What about the performance?

Each course can have (a minimum AND a maximum) OR (a specific amount) of participants.
All courses has a minimum and a maximum, for some courses it happens to be the same value. It might seem trivial but thinking about it that way lets you define the problem in a simpler way.
Instead of:
min_students == X OR (min_students >= X AND max_students <= Y)
you can express it as:
num_students BETWEEN min_students AND max_students
BETWEEN is inclusive, so 8 BETWEEN 8 and 8 is true
Regarding optimizations
Additional conditionals makes queries exponentially harder to understand for humans and that leads to missed edge cases and usually results in inefficient queries anyway. Focus on making the code easy to understand, or "elegant", and never sacrifice readability for performance unless you are really sure that you have a performance issue in the first place and that your optimization actually helps.
If you have a table with 10M rows it might be worth looking at super optimizing disk usage if you run on extremely limited hw but reducing the disk usage of a table even with 20 MB is almost certainly wasting time in any normal circumstance even when it doesn't make the code more complicated.
Besides, each row takes up 23-24 bytes in addition to any actual data it contains so shaving of a byte or two wouldn't make a big difference. Setting values to NULL can actually increase disk usage in some situations.
Alternative solution
When using a range data type the comparison would look like this:
num_students #> x
where num_students represents a range (for example 4 to 10) and #> means "contains the value"
create table num_sequence (num int);
create table courses_range (name text, num_students int4range);
insert into num_sequence select generate_series(3,10);
insert into courses_range values
('math', '[4,4]'), ('physics', '[6,7]'), ('dance', '[7,9]');
select * from num_sequence
left join courses_range on num_students #> num;
num | name | num_students
-----+---------+--------------
3 | |
4 | math | [4,5)
5 | |
6 | physics | [6,8)
7 | physics | [6,8)
7 | dance | [7,10)
8 | dance | [7,10)
9 | dance | [7,10)
10 | |
Note that the ranges are output formatted like [x,y), hard brackets means inclusive while parenthesis means exclusive and that for integers: [4,4] = [4,5) = (3,5)

Related

SQL 3 NF normalization

I have a table with 4 columns to display the bill of buying an item, column 1 is the primary key, column 2 and 3 are the price and the amount of the item, column 4 is the sum of the price, which is calculated by multiply the value in column 2 and 3. Do i need to delete column 4 to make sure that there is no transitive functional dependency in the table.
+---------+-------+--------+------+
| bill_id | price | amount | sum |
+---------+-------+--------+------+
| 1 | 2 | 5 | 10 |
| 2 | 3 | 5 | 15 |
+---------+-------+--------+------+
No, you don't need to. NFs are guidelines - very important and recommended, but just guidelines. While violation of 1NF is almost always a bad design decision, you may choose to violate 3NF and even 2NF provided you know what you are doing.
In this case, depending on the context, you may choose not to have a physycal "sum" column, and have the value calculated on the fly, although this could easily raise performance issues. But please don't even think of creating a new table which would have all possible combinations of quantity and unit price as a compound PK!
If you look at any ERP on the market you will see they store quantity, unit price, and total amount (and much more!) for every line of the quotes, orders, and invoices they handle.

Tableau - Return Name of Column with Max Value

I am new to Tableau visualization and need some help.
I have a set of shipping lanes which have whole numbers values based on the duration for the shipment.
Ex:
| Lane Name | 0 Day | 1 Day | 2 Day | 3 Day | 4 Day |
| SFO-LAX | 0 | 30 | 60 | 10 | 0 |
| JFK-LAX | 0 | 10 | 20 | 50 | 80 |
For each Lane Name, I want to return the column header based on the max value.
i.e. for SFO-LAX I would return '2 Day', for JFK-LAX I would return '4 Day', etc.
I then want to set this as a filter to only show 2 Day and 3 Day results in my Tableau data set.
Can someone help?
Two steps to this.
The first step is pretty easy, pivot your data. Read the Tableau help to learn how to PIVOT your data for analysis - i.e. make it look to Tableau as a longer 3 column data set with Lane, Duration, Value as the 3 columns. Tableau's PIVOT feature will let you view your data in that format (which makes analysis much easier) without actually changing the format of your data files.
The second step is a bit trickier than you'd expect at first glance, and there are a few ways to accomplish it. The Tableau features that can be used for this are LOD calcs, table calcs, filters and possibly sets. These are some of the more powerful but complicated parts of Tableau, so worth your time to learn about, but expect to take a while to spin up on them.
The easiest solution is probably to use one of the RANK() function - start as a quick table calc. Set your partitioning and addressing as desired so that the ranks are computed for the blocks of data that you desire - say partitioning on Lane and addressing or computing by Duration. Then when you are happy with the ranks you see, move the rank calculation to the filter shelf and only display data where rank = 1.
This is a quick solution once you get the hang of it, but it can get slow for very large data sets since the rank calculations are done on the client side, requiring fetching all the data that you end up not displaying. If performance becomes an issue, you might want to look at other solutions to do more of the calculations server side - possibly using LOD calcs or analytic aka windowing queries invoked from custom SQL

Best way to filter int columns in Postgres

Imagine I have a table like this:
id | value
----|------
1 | 1200
2 | 3450
3 | 1230
4 | 1245
5 | 4512
Both id and value are Integer, now I want to filter all rows that their value starts with 12, in this case I want these ids: 1,3,4.
I figure some different ways to do this:
Cast value field to String and then use LIKE or regex to filter
Use div to divide those values by 100 and then compare
Use Shif-Right to shift those values by 2 then compare
In the first case I'm not really sure about the performance because working with strings get the most time.
In the third case I don't know how to do it or is it even possible to do something like that?
Generally I want to know whats the best practice to do it and how? is there a better way that I couldn't find?
In addition, I'm using Postgres and working with SQLalchemy ORM,
so either a SQL query or SQLalchemy query will be accepted.

postgres ANY() with BETWEEN Condition

in case someone is wondering, i am recycling a different question i answered myself, because is realized that my problem has a different root-cause than i thought:
My question actually seems pretty simple, but i cannot find a way.
How do is query postgres if any element of an array is between two values?
The Documentation states that a BETWEEN b and c is equivalent to a > b and a < c
This however does not work on arrays, as
ANY({1, 101}) BETWEEN 10 and 20 has to be false
while
ANY({1,101}) > 10 AND ANY({1,101}) < 20 has to be true.
{1,101} meaning an array containing the two elements 1 and 101.
how can i solve this problem, without resorting to workarounds?
regards,
BillDoor
EDIT: for clarity:
The scenario i have is, i am querying an xml document via xpath(), but for this problem a column containing an array of type int[] does the job.
id::int | numbers::int[] | name::text
1 | {1,3,200} | Alice
2 | {21,100} | Bob
I want all Names, where there is a number that is between 20 and 30 - so i want Bob
The query
SELECT name from table where ANY(numbers) > 20 AND ANY(numbers) < 30
will return Alice and Bob, showing that alice has numbers > 20 as well as other numbers < 30.
A BETWEEN syntax is not allowed in this case, however between only gets mapped to > 20 AND < 30 internally anyways
Quoting the docs on the Between Operators' mapping to > and < documentation:
There is no difference between the two respective forms apart from the
CPU cycles required to rewrite the first one into the second one
internally.
PS.:
Just to avoid adding a new question for this: how can i solve
id::int | numbers::int[] | name::text
1 | {1,3,200} | Alice
2 | {21,100} | Alicia
SELECT id FROM table WHERE ANY(name) LIKE 'Alic%'
result: 1, 2
i can only find examples of matching one value to multiple regex, but not matching one regex against a set of values :/. Besides the shown syntax is invalid, ANY has to be the second operand, but the second operand of LIKE has to be the regex.
exists (select * from (select unnest(array[1,101]) x ) q1 where x between 10 and 20 )
you can create a function based on on this query
second approach:
select int4range(10,20,'[]') #> any(array[1, 101])
for timestamps and dates its like:
select tsrange( '2015-01-01'::timestamp,'2015-05-01'::timestamp,'[]') #> any(array['2015-05-01', '2015-05-02']::timestamp[])
for more info read: range operators

PostgreSQL Fuzzy Searching multiple words with Levenshtein

I am working out a postgreSQL query to allow for fuzzy searching capabilities when searching for a company's name in an app that I am working on. I have found and have been working with Postgres' Levenshtein method (part of the fuzzystrmatch module) and for the most part it is working. However, it only seems to work when the company's name is one word, for example:
With Apple (which is stored in the database as simply apple) I can run the following query and have it work near perfectly (it returns a levenshtein distance of 0):
SELECT * FROM contents
WHERE levenshtein(company_name, 'apple') < 4;
However when I take the same approach with Sony (which is stored in the database as Sony Electronics INC) I am unable to get any useful results (entering Sony gives a levenshtein distance of 16).
I have tried to remedy this problem by breaking the company's name down into individual words and inputting each one individually, resulting in something like this:
user input => 'sony'
SELECT * FROM contents
WHERE levenshtein('Sony', 'sony') < 4
OR levenshtein('Electronics', 'sony') < 4
OR levenshtein('INC', 'sony') < 4;
So my question is this: is there some way that I can accurately implement a multi-word fuzzy search with the current general approach that I have now, or am I looking in the complete wrong place?
Thanks!
Given your data and the following query with wild values for the Levenshtein Insertion (10000), Deletion (100) and Substitution (1) cost:
with sample_data as (select 101 "id", 'Sony Entertainment Inc' as "name"
union
select 102 "id",'Apple Corp' as "name")
select sample_data.id,sample_data.name, components.part,
levenshtein(components.part,'sony',10000,100,1) ld_sony
from sample_data
inner join (select sd.id,
lower(unnest(regexp_split_to_array(sd.name,E'\\s+'))) part
from sample_data sd) components on components.id = sample_data.id
The output is so:
id | name | part | ld_sony
-----+------------------------+---------------+---------
101 | Sony Entertainment Inc | sony | 0
101 | Sony Entertainment Inc | entertainment | 903
101 | Sony Entertainment Inc | inc | 10002
102 | Apple Corp | apple | 104
102 | Apple Corp | corp | 3
(5 rows)
Row 1 - no changes..
Row 2 - 9 deletions and 3 changes
Row 3 - 1 insertion and 2 changes
Row 4 - 1 deletion and 4 changes
Row 5 - 3 changes
I've found that splitting the words out causes a lot of false positives whe you give a threshold. You can order by the Levenshtein distance to position the better matches close to the top. Maybe tweaking the Levenshtein variables will help you to order the matches better. Sadly, Levenshtein doesn't weight earlier changes differently than later changes.