SQL 3 NF normalization - database-normalization

I have a table with 4 columns to display the bill of buying an item, column 1 is the primary key, column 2 and 3 are the price and the amount of the item, column 4 is the sum of the price, which is calculated by multiply the value in column 2 and 3. Do i need to delete column 4 to make sure that there is no transitive functional dependency in the table.
+---------+-------+--------+------+
| bill_id | price | amount | sum |
+---------+-------+--------+------+
| 1 | 2 | 5 | 10 |
| 2 | 3 | 5 | 15 |
+---------+-------+--------+------+

No, you don't need to. NFs are guidelines - very important and recommended, but just guidelines. While violation of 1NF is almost always a bad design decision, you may choose to violate 3NF and even 2NF provided you know what you are doing.
In this case, depending on the context, you may choose not to have a physycal "sum" column, and have the value calculated on the fly, although this could easily raise performance issues. But please don't even think of creating a new table which would have all possible combinations of quantity and unit price as a compound PK!
If you look at any ERP on the market you will see they store quantity, unit price, and total amount (and much more!) for every line of the quotes, orders, and invoices they handle.

Related

Getting breakdown of "Others" (the rest of Top N members) with SSAS MDX

How can I recursively get the breakdown of "Others" when Top N is applied to dimensions?
Imagine a measure Sales Amount is sliced by 3 dimensions, Region, Category and Product, and Top 1 is applied to each dimension. The result I want to see is a table like below. On each slice, the rest of members are grouped as "Others".
Region | Category | Product | Sales
============================================
Europe | Bikes | Mountain Bikes | $100
| |------------------------
| | Others | $ 30
|-----------------------------------
| Others | Gloves | $ 50
| |------------------------
| | Others | $120
--------------------------------------------
Others | Clothes | Jackets | $ 80
| |------------------------
| | Others | $130
|-----------------------------------
| Others | Shoes | $ 90
| |------------------------
| | Others | $110
--------------------------------------------
When an "Others" appears, I want to see the Top 1 of the next dimension within the scope of this "Others". This seems a little tricky. e.g. tuples like (North America, Clothes) and (Central America, Clothes) need to be aggregated as (Other Regions, Clothes). Is there a neat way to aggregate the measure based on the 2nd dimension, Category?
Alternatively, I think a sub cube that filters out Europe will easily provide the breakdown of Other Regions, Clothes and Other Categories. However, this is likely to result in creating many dependent queries. For an easy processing of the result set, it would be ideal if the query returns data in the above format.
Can this be possibly achieved by a single MDX query?
To get the breakdown of others we must use dynamic set, EXCEPT() and aggregate functions
in each of the three dimensions we will need to create a named dynamic set that holds too members (top 1 and others ).
as exemple, in the dimension category i have created a dynamic set that holds two members (Top 1 and others) like this :
CREATE MEMBER
CURRENTCUBE.[Product].[French Product Category Name].[ALL].[OTHERS] AS
AGGREGATE(EXCEPT([Product].[French Product Category Name].[French Product Category Name].MEMBERS,
TOPCOUNT([Product].[French Product Category Name].[French Product Category Name],1,[Measures].[Sales Amount])
));
CREATE DYNAMIC SET [TOP1 and Others]
AS {TOPCOUNT([Product].[French Product Category Name].[French Product Category Name],1,[Measures].[Sales Amount]),[OTHERS]};
because the set is dynamic then the values of top 1 and others will change according to the filters and slicers that you applay.

Designing a database to save and query a dynamic range?

I need to design a (postgres) database table which can save a dynamic range of something.
Example:
We have a course table. Each course can have (a minimum AND a maximum) OR (a specific amount) of participants.
A math course can be started with 4 to 10 students while a physics course needs to have exactly 8 students to start.
After that, I want to be able to query that.
Let's say, I want all courses which can take 6 students. The math course should be returned, the physics course shouldn't as it requires exactly 8 students.
When I query for 8 students, both courses should be returned.
For the implementation I thought about two simple fields: min_students and max_students. Then I could simply check if the number is equal to or between these numbers.
The issue is: I have to fill both columns everytime. Also for the physics course which requires exactly 8 students.
example:
name | min_students | max_students
--------|--------------|-------------
math | 4 | 10
physics | 8 | 8
Is there a more elegant/efficient way? I also thought about making the max_students column nullable so I could check for
min_students = X OR (min_students >= X AND max_students <= Y)
Would that be more efficient? What about the performance?
Each course can have (a minimum AND a maximum) OR (a specific amount) of participants.
All courses has a minimum and a maximum, for some courses it happens to be the same value. It might seem trivial but thinking about it that way lets you define the problem in a simpler way.
Instead of:
min_students == X OR (min_students >= X AND max_students <= Y)
you can express it as:
num_students BETWEEN min_students AND max_students
BETWEEN is inclusive, so 8 BETWEEN 8 and 8 is true
Regarding optimizations
Additional conditionals makes queries exponentially harder to understand for humans and that leads to missed edge cases and usually results in inefficient queries anyway. Focus on making the code easy to understand, or "elegant", and never sacrifice readability for performance unless you are really sure that you have a performance issue in the first place and that your optimization actually helps.
If you have a table with 10M rows it might be worth looking at super optimizing disk usage if you run on extremely limited hw but reducing the disk usage of a table even with 20 MB is almost certainly wasting time in any normal circumstance even when it doesn't make the code more complicated.
Besides, each row takes up 23-24 bytes in addition to any actual data it contains so shaving of a byte or two wouldn't make a big difference. Setting values to NULL can actually increase disk usage in some situations.
Alternative solution
When using a range data type the comparison would look like this:
num_students #> x
where num_students represents a range (for example 4 to 10) and #> means "contains the value"
create table num_sequence (num int);
create table courses_range (name text, num_students int4range);
insert into num_sequence select generate_series(3,10);
insert into courses_range values
('math', '[4,4]'), ('physics', '[6,7]'), ('dance', '[7,9]');
select * from num_sequence
left join courses_range on num_students #> num;
num | name | num_students
-----+---------+--------------
3 | |
4 | math | [4,5)
5 | |
6 | physics | [6,8)
7 | physics | [6,8)
7 | dance | [7,10)
8 | dance | [7,10)
9 | dance | [7,10)
10 | |
Note that the ranges are output formatted like [x,y), hard brackets means inclusive while parenthesis means exclusive and that for integers: [4,4] = [4,5) = (3,5)

PostgreSQL Fuzzy Searching multiple words with Levenshtein

I am working out a postgreSQL query to allow for fuzzy searching capabilities when searching for a company's name in an app that I am working on. I have found and have been working with Postgres' Levenshtein method (part of the fuzzystrmatch module) and for the most part it is working. However, it only seems to work when the company's name is one word, for example:
With Apple (which is stored in the database as simply apple) I can run the following query and have it work near perfectly (it returns a levenshtein distance of 0):
SELECT * FROM contents
WHERE levenshtein(company_name, 'apple') < 4;
However when I take the same approach with Sony (which is stored in the database as Sony Electronics INC) I am unable to get any useful results (entering Sony gives a levenshtein distance of 16).
I have tried to remedy this problem by breaking the company's name down into individual words and inputting each one individually, resulting in something like this:
user input => 'sony'
SELECT * FROM contents
WHERE levenshtein('Sony', 'sony') < 4
OR levenshtein('Electronics', 'sony') < 4
OR levenshtein('INC', 'sony') < 4;
So my question is this: is there some way that I can accurately implement a multi-word fuzzy search with the current general approach that I have now, or am I looking in the complete wrong place?
Thanks!
Given your data and the following query with wild values for the Levenshtein Insertion (10000), Deletion (100) and Substitution (1) cost:
with sample_data as (select 101 "id", 'Sony Entertainment Inc' as "name"
union
select 102 "id",'Apple Corp' as "name")
select sample_data.id,sample_data.name, components.part,
levenshtein(components.part,'sony',10000,100,1) ld_sony
from sample_data
inner join (select sd.id,
lower(unnest(regexp_split_to_array(sd.name,E'\\s+'))) part
from sample_data sd) components on components.id = sample_data.id
The output is so:
id | name | part | ld_sony
-----+------------------------+---------------+---------
101 | Sony Entertainment Inc | sony | 0
101 | Sony Entertainment Inc | entertainment | 903
101 | Sony Entertainment Inc | inc | 10002
102 | Apple Corp | apple | 104
102 | Apple Corp | corp | 3
(5 rows)
Row 1 - no changes..
Row 2 - 9 deletions and 3 changes
Row 3 - 1 insertion and 2 changes
Row 4 - 1 deletion and 4 changes
Row 5 - 3 changes
I've found that splitting the words out causes a lot of false positives whe you give a threshold. You can order by the Levenshtein distance to position the better matches close to the top. Maybe tweaking the Levenshtein variables will help you to order the matches better. Sadly, Levenshtein doesn't weight earlier changes differently than later changes.

will select always start with the first (oldest) record in a table

I made a table like
record
----------+
1 | one |
----------+
2 | two |
----------+
3 | three |
----------+
4 | four |
----------+
5 | five |
----------+
There isn't an ID column, those are just the row numbers I see beside each row in DBVisualizer. I added the rows in the order 1, 2, 3, 4, 5. Is this"
SELECT
*
FROM
sch.test_table limit 1;
certain always to get one, ie start with the "oldest" record? Or is will that change in large datasets?
No, as per the SQL specification the order is indeterminate when not using order by. You're working with a set of data, and sets are not ordered. Also, the size of the set should not matter.
The Postgresql documentation says:
If ORDER BY is not given, the rows are returned in whatever order the
system finds fastest to produce.
Which means that the rows might come back in the expected order, or they might not - there are no guarantees.
The bottom line is that if you want deterministic results you have to useorder by.

Cross tab summary fields don't restrict by column

so I'm working with Crystal Reports 10 and was looking at the cross tab to try and have a nice and neat table of my information. I'm trying to make a report where for each item (as a row), the columns will be the different sizes it comes in and the value of that cell will be the quantity.
So something that looks like this:
Small | Medium | Large
Item 1 1 | 5 | 10
Item 2 5 | 10 | 15
Using the cross tab though, the quantity field I have has to be totalled, averaged, etc. so I can't get the specific breakdown for each size in a nice table like that. Is there any way to tweak the Cross Tab to do this or is there another tool in Crystal Reports that would let me have the quantities per size in that organized fashion?
Thanks for any help you guys can give.
Update:
The cross tab I have tried gives me something that looks like this
Small | Medium | Large
Item 1 16 | 16 | 16
Item 2 30 | 30 | 30
If I put the values in the details section as separate fields, I'm able to get the values to match up properly, but its not the right format. It comes out like this
Item 1 | Small | 1
Item 1 | Medium| 5
Item 1 | Large | 10
Create a Cross-tab
Add {table.size} to the Columns
Add {table.item} to the Rows
Add {table.quantity} to Summarized Fields