Understanding database schema - normalized forms - database-schema

I have this current setup:
product
product_id | product_name | category_id
category
category_id | category_name
vendor
vendor_id | vendor_name | vendor_status
vendor_price
vendor_id | product_id | vendor_price
As I understand it, according to the "rules" of normalization there should be 2
more tables declaring the relationship like this:
rel_product_vendor_price
product_id | vendor_price_id
rel_vendor_price_vendor
vendor_price_id | vendor_id
Then the above table called vendor_price would have product_id removed and
added a vendor_price_id.
I fail to see the point in creating yet two more tables to keep things together
as it will complicate queries. Especially the INSERTS are complicated and must be performed in transactions.
Currently the tables holds more than 300.000 products where each has several
different vendors with different prices to each making it count as more than
1.5 million documents in Sphinx.
Am I wrong in my design, or would there be any advantage in changing it to a more normalized design?
UPDATE
I have a table more to hold all the product categories. I have updated the schema above, forgot that in the initial post.
Generally I split the queries based on category and I query each category for all the belonging products. When a user clicks a product I query all the prices for that particular product and display the prices in descending order.
Because a vendor can be suspended (vendor.vendor_status) all queries must be performed with several joins leading back to the vendor table.
In the inserts I delete everything in product from a particular vendor, all vendor prices from the same vendor gets deleted as well due to foreign key constraint. Then I insert a new into product and vendor_price.
Hope this makes sense.
UPDATE 2
Having run a lot of query testing this night, I have discovered that keeping the vendor_status in the vendor table REALLY slows things down a LOT.
Because the database has to join selects between vendor_price and vendor each time it is selecting a price, which has a great importance in getting for example:
MIN(vendor_price) AS min_vendor_price, MAX(vendor_price) AS max_vendor_price)
Keeping a duplicate of vendor_status in each vendor_price row would mean a LOT of redundant data, but it really speeds things up in selects.
From
Query took 7.8040 sec
To
Query took 3.1640 sec
When data sets get this large I guess it's a matter of balancing between optimizing queries and using a LOT of cache features. Normalization really gets in the way when it comes to speed even on todays hardware.

Normalization attempts to eliminate redundant data so inserts/updates/deletes don't have to work on more than one table at a time; on the contrary redundant data can speed up queries by eliminating the need for lots of joins, but then you have to deal with inserting/updating/deleting in multiple places. Your 3 table schema looks fine to me, assuming you just want to lookup prices based on vendor ids and product ids, but please give more background on the type of queries you hope to run / what other kinds of data you're planning on storing.

Related

Can I have a list of foreign keys as a single field? [duplicate]

Newbie trying to figure out the best way to design a Postgres db for the following use case scenario.
There is an Account table for the business customers and there is a contacts table with a column relationship.
account.pk_id, ….
contacts.pk_id, contacts.fk_accountid …
Thousands of different businesses in the Accounts table will be storing millions of contacts each in the Contacts table.
Each contact record will over time belong to between 1 and 100 different categories, lists and products.
If I use a classic sql master/child relationship I potentially end up with millions and millions of rows in tables such as contacts_categories, contacts_lists and contacts_products which would reference from Categories, Lists & Products tables.
Alternatively, I could store the related keys ( uuid’s) for categories, lists and products in 3 character varying arrays[] columns in the contact record row. This would eliminate the need for the contacts_categories, contacts_lists and contacts_products tables that would be quite large.
With tools like Select unnest, array_append() and the array index options it seems like a smart solution but am curious to know if it is better to stick to normalized relations and more tables and row counts for performance and / or storage memory / cost.
Anybody tried this before ?
Too many people have tried that, and it is a bad idea. Many of your queries, particularly joins, will become complicated and slow. Besides, you won't be able to have foreign key constraints to guarantee data integrity.
Relational databases are good at coping with millions of rows in a table. Keep your schema normalized.

How to avoid customer's order history being changed in MongoDB?

I have two collections
Customers
Products
I have a field called "orders" in each of my customer document and what this "orders" field does is that it stores a reference to the product Id which was ordered by a customer, now my question is since I'm referencing product Id and if I update the "title" of that product then it will also update in the customer's order history since I can't embed each order information since a customer may order thousands of products and it can hit 16mb mark in no time so what's the fix for this. Thanks.
Create an Orders Collection
Store ID of the user who made the order
Store ID of the product bought
I understand you are looking up the value of the product from the customer entity. You will always get the latest price if you are not storing the order/price historical transactions. Because your data model is designed this way to retrieve the latest price information.
My suggestion.
Orders place with product and price always need to be stored in history entity or like order lines and not allow any process to change it so that when you look up products that customers brought you can always get the historical price and price change of the product should not affect the previous order. Two options.
Store the order history in the current collection customers (or top say 50 order lines if you don't need all of history(write additional logic to handle this)
if "option1" is not feasible due to large no. of orders think of creating an order lines transaction table and refer order line for the product brought via DBref or lookup command.
Note: it would have helped if you have given no. of transactions in each collection currently and its expected rate of growth of documents in the collection QoQ.
You have orders and products. Orders are referencing products. Your problem is that the products get updated and now your orders reference the new product. The easiest way to combat this issue is to store full data in each order. Store all the key product-related information.
The advantage is that this kind of solution is extremely easy to visualize and implement. The disadvantage is that you have a lot of repetitive data since most of your products probably don't get updated.
If you store a product update history based on timestamps, then you could solve your problem. Products are identified now by 3 fields. The product ID, active start date and active end date. Or you could configure products in this way: product ID = product ID + "Version X" and store this version against each order.
If you use dates, then you will query for the product and find the product version that was active during the time period that the order occurred. If you use versions against the product, then you will simply query the database for the particular version of the product itself. I haven't used mongoDb so I'm not sure how you would achieve this in mongoDb exactly. Naively however, you can modify the product ID to include the version as well using # as a delimiter possibly.
The advantage of this solution is that you don't store too much of extra data. Considering that products won't be updated too often, I feel like this is the ideal solution to your problem

Feedback about my database design (multi tenancy)

The idea of the SaaS tool is to have dynamic tables with dynamic custom fields and values of different types, we were thinking to use "force.com/salesforce.com" example but is seems to be too complicated to maintain moving forward, also making some reports to create with a huge abstraction level, so we came up with simple idea but we have to be sure that this is kinda good approach.
This is the architecture we have today (in few steps).
Each tenant has it own separate database on the cluster (Postgres 12).
TABLE table, used to keep all of those tables as reference, this entity has ManyToOne relation to META table and OneToMany relation with DATA table.
META table is used for metadata configuration, has OneToMany relation with FIELDS (which has name of the fields as well as the type of field e.g. TEXT/INTEGER/BOOLEAN/DATETIME etc. and attribute value - as string, only as reference).
DATA table has ManyToOne relation to TABLES and 50 character varying columns with names like: attribute1...50 which are NULL-able.
Example flow today:
When user wants to open a TABLE DATA e.g. "CARS", we load the META table with all the FIELDS (to get fields for this query). User specified that he want to query against: Brand, Class, Year, Price columns.
We are checking by the logic, the reference for Brand, Class, Year and Price in META>FIELDS table, so we know that Brand = attribute2, Class = attribute 5, Year = attribute6 and Price = attribute7.
We parse his request into a query e.g.: SELECT [attr...2,5,6,7] FROM DATA and then show the results to user, if user decide to do some filters on it, based on this data e.g. Year > 2017 AND Class = 'A' we use CAST() functionality of SQL for example SELECT CAST(attribute6 AS int) AND attribute5 FROM DATA WHERE CAST(attribute6 AS int) > 2017 AND attribute5 = 'A';, so then we can actually support most principles of SQL.
However moving forward we are scared a bit:
Manage such a environment for more tenants while we are going to have more tables (e.g. 50 per customer, with roughly 1-5 mil per TABLE (5mil is maximum which we allow, for bigger data we have BigQuery) which is giving us 50-250 mil rows in single table DATA_X) which might affect performance of the queries, especially when we gave possibilities to manage simple WHERE statements (less,equal,null etc.) using some abstraction language e.g. GET CARS [BRAND,CLASS,PRICE...] FILTER [EQ(CLASS,A),MT(YEAR,2017)] developed to be similar to JQL (Jira Query Language).
Transactions lock, as we allow to batch upload CSV into the DATA_X so once they want to load e.g. 1GB of the data, it kinda locks the table for other systems to access the DATA table.
Keeping multiple NULL columns which can affect space a bit (for now we are not that scared as while TABLE creation, customer can decide how many columns he wants, so based on that we are assigning this TABLE to one of hardcoded entities DATA_5, DATA_10, DATA_15, DATA_20, DATA_30, DATA_50, where numbers corresponds to limitations of the attribute columns, and those entities are different, we also support migration option if they decide to switch from 5 to 10 attributes etc.
We are on super early stage, so we can/should make those before we scale, as we knew that this is most likely not the best approach, but we kept it to run the project for small customers which for now is working just fine.
We were thinking also about JSONB objects but that is not the option, as we want to keep it simple for getting the data.
What do you think about this solution (fyi DATA has PRIMARY key out of 2 tables - (ID,TABLEID) and built in column CreatedAt which is used form most of the queries, so there will be maximum 3 indexes)?
If it seem bad, what would you recommend as the alternative to this solution based on the details which I shared (basically schema-less RDBMS)?
IMHO, I anticipate issues when you wanted to join tables and also using cast etc.
We had followed the approach below that will be of help to you
We have a table called as Cars and also have a couple of tables like CarsMeta, CarsExtension columns. The underlying Cars table will have all the common fields for a ll tenant's. Also, we will have the CarsMeta table point out what are the types of columns that you can have for extending the Cars entity. In the CarsExtension table, you will have columns like StringCol1...5, IntCol1....5, LongCol1...10
In this way, you can easily filter for data also like,
If you have a filter on the base table, perform the search, if results are found, match the ids to the CarsExtension table to get the list of exentended rows for this entity
In case the filter is on the extended fields, do a search on the extension table and match with that of the base entity ids.
As we will have the extension table organized like below
id - UniqueId
entityid - uniqueid (points to the primary key of the entity)
StringCol1 - string,
...
IntCol1 - int,
...
In this case, it will be easy to do a join for entity and then get the data along with the extension fields.
In case you are having the table metadata and data being inferred from separate tables, it will be a difficult task to maintain this over long period of time and also huge volume of data.
HTH

Is there a way to include a column from one table in many other tables (while maintaining consistency) in PostgreSQL?

I'm trying to build a database (in PostgreSQL 9.6.6) that allows for one "master column" (items.id) to be replicated in to many (automatically generated) tables (e.g. rank1.id, rank2.id, rank3.id, ...). Only items will have INSERT's (or DELETE's) performed and when they are the newly added id's should also show up (or be removed) in the rankX table(s). To be more concrete:
items:
id | name | description
rank1:
id | rank
rank2:
id | rank
...
Where the id's are always the same, and there is always the same number of rows in each of the tables. The rankX.rank values, however, will be different (imagine users ranking how funny a series of images are -- the images all have the same id's but different users might rank them differently).
What I was thinking was that when a new user was added and a new rankX table created I would do the following:
Have rankX.id referencing a foreign key items.id (with ON DELETE CASCADE)
Copy any items.id that already exist
Auto-generate a trigger function that mirrors the INSERT's to items to the rankX table
This seems cumbersome and wasteful of space since all of the xxxx.id columns are identical and I will end up with hundreds or thousands of trigger functions. As someone new to relational databases I was hoping there was an easier way to achieve this.
So, I have a few questions:
Is there a more efficient way to define my tables such that all of this copying isn't necessary?
If this the best way, can you give an example of how you would set up the triggers (and associated functions)?
Do I need to worry about running out of space on the server as I create (potentially many) sets of triggers of this type?

How to handle many to many in DynamoDB

I am new to NoSql and DynamoDb, but from RDBMS..
My tables are being moved from MySql to DynamoDb. I have tables:
customer (columns: cid [PK], name, contact)
Hardware (columns: hid[PK], name, type )
Rent (columns: rid[PK], cid, hid, time) . => this is the association of customer and Hardware item.
one customer can have many Hardware Items and one Hardware Item can be shared among many customers.
Requirements: seperate lists of customers and hadware items should be able to retrieve.
Rent details- which customer barrowed which Hardeware Item.
I referred this - secondary index table. This is about keeping all columns in one table.
I thought to have 2 DynamoDb tables:
Customer - This has all attributes similar to columns AND set of hardware Item hash keys. (Then my issue is, when customer table is queried to retrieve only customers, all hardware keys are also loaded.)
Any guidance please for table structure? How to save, and load, and even updates ?
Any java samples please? (couldn't find any useful resource which similar to my scenario)
Have a look on DynamoDB's Adjacency List Design Pattern
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-adjacency-graphs.html
In your case, based on Adjacency List Design Pattern, your schema can be designed as following
The prefix of partition key and sort key indicate the type of record.
If the record type is customer, both partition key and sort key should have the prefix 'customer-'.
If the record is that the customer rents the hardware, the partition key's prefix should be 'customer-' and the sort key's prefix should be 'hardware-'
base table
+------------+------------+-------------+
|PK |SK |Attributes |
|------------|------------|-------------|
|customer-cid|customer-cid|name, contact|
|hardware-hid|hardware-hid|name, type |
|customer-cid|hardware-hid|time |
+------------+------------+-------------+
Global Secondary Index Table
+------------+------------+----------+
|GSI-1-PK |GSI-1-SK |Attributes|
|------------|------------|----------|
|hardware-hid|customer-cid|time |
+------------+------------+----------+
customer and hardware should be stored in the same table. customer can refer to hardware by using
SELECT * FROM base_table WHERE PK=customer-123 AND SK.startsWith('hardware-')
if you hardware want to refer back to customer, you should use GSI table
SELECT * FROM GSI_table WHERE PK=hardware-333 AND SK.startsWith('customer-')
notice: the SQL I wrote is just pseudo code, to provide you an idea only.
Take a look at this answer, as it covers many of the basics which are relevant to you.
DynamoDB does not support foreign keys as such. Each table is independent and there are no special tools for keeping two tables synchronised.
You would probably have an attribute in your customers table called hardwares. The attribute would be a list of hardware ids the customer has. If you wanted to see all hardware items belonging to a customer you would:
Perform GetItem on the customer id. Or use Query depending on how you are looking the customer up.
For each hardware id in the customer's hardware attribute, perform a GetItem on the Hardware table.
With DynamoDB you generally end up doing more in the client application relative to an RDBMS solution. The benefits are that its fast and simple. But you will find you probably move a lot of your work from the database server to your client server.