Postgres db design Normalize tables or Use Array Columns - postgresql

Newbie trying to figure out the best way to design a Postgres db for the following use case scenario.
There is an Account table for the business customers and there is a contacts table with a column relationship.
account.pk_id, ….
contacts.pk_id, contacts.fk_accountid …
Thousands of different businesses in the Accounts table will be storing millions of contacts each in the Contacts table.
Each contact record will over time belong to between 1 and 100 different categories, lists and products.
If I use a classic sql master/child relationship I potentially end up with millions and millions of rows in tables such as contacts_categories, contacts_lists and contacts_products which would reference from Categories, Lists & Products tables.
Alternatively, I could store the related keys ( uuid’s) for categories, lists and products in 3 character varying arrays[] columns in the contact record row. This would eliminate the need for the contacts_categories, contacts_lists and contacts_products tables that would be quite large.
With tools like Select unnest, array_append() and the array index options it seems like a smart solution but am curious to know if it is better to stick to normalized relations and more tables and row counts for performance and / or storage memory / cost.
Anybody tried this before ?

Too many people have tried that, and it is a bad idea. Many of your queries, particularly joins, will become complicated and slow. Besides, you won't be able to have foreign key constraints to guarantee data integrity.
Relational databases are good at coping with millions of rows in a table. Keep your schema normalized.

Related

Design question for a table with too many joins OR polymorphic relations in Postgres 11.7

I've been given a table that I'm not sure how to design. I'm hoping for some design suggestions, or pointers in the right direction. The table is called edge and is meant to store some event traces, and IDs that link out to a host of possible lookup tables. Leaving out everything but IDs, here's what the table contains, all UUIDs:
ID
InvID
OrgID
FacilityID
FromAssemblyID
FromAssociatedTo
FromAssociatedToID
FromClinicID
FromFacilityDepartmentID
FromFacilityID
FromFacilityLocationID
FromScanAtFacilityID
FromScanID
FromSCaseID
FromSterilizerLoadID
FromWasherLoadID
FromWebUserID
ToAssemblyID
ToAssociatedTo
ToAssociatedToID
ToClinicID
ToFacilityDepartmentID
ToFacilityID
ToFacilityLocationID
ToNodeDTS
ToScanAtFacilityID
ToScanID
ToSCaseID
ToSterilizerLoadID
ToUserName
ToWasherLoadID
ToWebUserID
That's an overwhelming number of IDs to possibly join on. I remember reading that the Postgres planner kind of gives up when you've got a dozen+ joins. The idea being that there are so many permutations to explore, that the planning time could quickly overwhelm the query time. If you boil it down, the "from" and "to" links are only ever going to have one key value across all of those fields. So, implemented as a polymorphic/promiscuous relations, something like this:
ID
InvID
OrgID
FacilityID
FromID
FromType
ToID
ToType
ToWebUserID
This table is going to be ginormous, so speed is/will be a consideration.
I encouraged the author not to use a polymorphic design, although the appeal is obvious. (I like Karwin's SQL Antipatterns book.) But now, confronted with nearly three dozen IDs, I'm a bit stumped.
Is there a common solution to this kind of problem? Namely, where you've got a central table like this with connections to a wide variety of possible tables? I don't have a Data Warehousing background, but this looks somewhat like that. (The author of this table has read Kimball's books, but not done any Data Warehouse implementations either.)
Important: We're using JOIN to do lookups on related values that might change, we're not using it to change the size of the result set. Just pretend it would always be LEFT JOIN.
With that in mind, what I've thought of is to skip joining on the From and To IDs, and instead use custom function calls to look up required values from the related tables. like (pseudo-code)
GetUserName(uuid) : citext
...and os on for other values of interest in this and other tables...
The function would return '' when the UUID is 0000etc.
I appreciate that this isn't the crispest question in the history of SO, and I what I'm hoping for pointers in a fruitful direction.
This smacks of “premature optimization” (which is a source of evil) based on something that you “remember reading”, so maybe some enlightenment about join optimization will help.
One rule of thumb that I follow in questions like this is to model things so that your queries become simple and natural. Experience shows that that often leads to good performance.
I assume that the table you show is the fact table of a star schema, and the foreign keys point to the many dimension tables, so that your query will look like
SELECT ...
FROM fact
JOIN dim1 ON fact.dim1_id = dim1.id
JOIN dim2 ON fact.dim3_id = dim2.id
JOIN dim3 ON fact.dim3_id = dim3.id
...
WHERE dim1.col1 = ...
AND dim2.col2 BETWEEN ... AND ...
AND dim3.col3 < ...
...
Now PostgreSQL will by default only consider all join permutations of the first eight tables (join_collapse_limit), and the rest of the tables are just joined in the order in which they appear in the query.
Moreover, if the number of tables reaches the threshold of 12 (geqo_threshold), the genetic query optimizer takes over, a component that simulates evolution by mutation and survival of the fittest with randomly chosen execution plans (really!) and consequently doesn't always come up with the same execution plan for the same query.
So my advice would be to write the queries in a way that the first seven dimension tables are the ones with the biggest chance of reducing the number of result rows most significantly (based on the WHERE conditions). You can also increase join_collapse_limit, because if your queries take a long time to run anyway, you can easily afford the planner to spend more time thinking about the best plan.
Then you'd set geqo = off to disable the genetic query optimizer.
If you design your queries according to these principles, you should be able to get good execution plans without messing up the data model.

Use case for hstore against multiple columns

I'm having some troubles deciding on which approach to use.
I have several entity "types", let's call them A,B and C, who share a certain number of attributes (about 10-15). I created a table called ENTITIES, and a column for each of the common attributes.
A,B,C also have some (mostly)unique attributes (all boolean, can be 10 to 30 approx).
I'm unsure what is the best approach to follow in modelling the tables:
Create a column in the ENTITIES table for each attribute, meaning that entity types that don't share that attribute will just have a null value.
Use separate tables for the unique attributes of each entity type, which is a bit harder to manage.
Use an hstore column, each entity will store its unique flags in this column.
???
I'm inclined to use 3, but I'd like to know if there's a better solution.
(4) Inheritance
The cleanest style from a database-design point-of-view would probably be inheritance, like #yieldsfalsehood suggested in his comment. Here is an example with more information, code and links:
Select (retrieve) all records from multiple schemas using Postgres
The current implementation of inheritance in Postgres has a number of limitations, though. Among others, you cannot define a common foreign key constraints for all inheriting tables. Read the last chapter about caveats carefully.
(3) hstore, json (pg 9.2+) / jsonb (pg 9.4+)
A good alternative for lots of different or a changing set of attributes, especially since you can even have functional indices on attributes inside the column:
unique index or constraint on hstore key
Index for finding an element in a JSON array
jsonb indexing in Postgres 9.4
EAV type of storage has its own set of advantages and disadvantages. This question on dba.SE provides a very good overview.
(1) One table with lots of columns
It's the simple, kind of brute-force alternative. Judging from your description, you would end up with around 100 columns, most of them boolean and most of them NULL most of the time. Add a column entity_id to mark the type. Enforcing constraints per type is a bit awkward with lots of columns. I wouldn't bother with too many constraints that might not be needed.
The maximum number of columns allowed is 1600. With most of the columns being NULL, this upper limit applies. As long as you keep it down to 100 - 200 columns, I wouldn't worry. NULL storage is very cheap in Postgres (basically 1 bit per column, but it's more complex than that.). That's only like 10 - 20 bytes extra per row. Contrary to what one might assume (!), most probably much smaller on disk than the hstore solution.
While such a table looks monstrous to the human eye, it is no problem for Postgres to handle. RDBMSes specialize in brute force. You might define a set of views (for each type of entity) on top of the base table with just the columns of interest and work with those where applicable. That's like the reverse approach of inheritance. But this way you can have common indexes and foreign keys etc. Not that bad. I might do that.
All that said, the decision is still yours. It all depends on the details of your requirements.
In my line of work, we have rapidly-changing requirements, and we rarely get downtime for proper schema upgrades. Having done both the big-record with lots on nulls and highly normalized (name,value), I've been thinking that it might be nice it have all the common attributes in proper columns, and the different/less common ones in a "hstore" or jsonb bucket for the rest.

Efficient way of insert millions of rows, convert data and deal with it, on PostgreSQL+PostGIS

I have a big collection of data I want to use for user search later.
Currently I have 200 millions resources (~50GB). For each, I have latitude+longitude. The goal is to create spatial index to be able to do spatial queries on it.
So for that, the plan is to use PostgreSQL + PostGIS.
My data are on CSV file. I tried to use custom function to not insert duplicates, but after days of processing I gave up. I found a way to load it fast in the database: with COPY it takes less than 2 hours.
Then, I need to convert latitude+longitude on Geometry format. For that I just need to do:
ST_SetSRID(ST_MakePoint(longi::double precision,lat::double precision),4326))
After some checking, I saw that for 200 millions, I have 50 millions points. So, I think the best way is to have a table "TABLE_POINTS" that will store all the points, and a table "TABLE_RESOURCES" that will store resources with point_key.
So I need to fill "TABLE_POINTS" and "TABLE_RESOURCES" from temporary table "TABLE_TEMP" and not keeping duplicates.
For "POINTS" I did:
INSERT INTO TABLE_POINTS (point)
SELECT DISTINCT ST_SetSRID(ST_MakePoint(longi::double precision,lat::double precision),4326))
FROM TABLE_RESOURCES
I don't remember how much time it took, but I think it was matter of hours.
Then, to fill "RESOURCES", I tried:
INSERT INTO TABLE_RESOURCES (...,point_key)
SELECT DISTINCT ...,point_key
FROM TABLE_TEMP, TABLE_POINTS
WHERE ST_SetSRID(ST_MakePoint(longi::double precision,lat::double precision),4326) = point;
but again take days, and there is no way to see how far the query is ...
Also something important, the number of resources will continue to grow up, currently should be like 100K added by day, so storage should be optimized to keep fast access to data.
So if you have any idea for the loading or the optimization of the storage you are welcome.
Look into optimizing postgres first (ie google postgres unlogged, wal and fsync options), second do you really need points to be unique? Maybe just have one table with resources and points combined and not worry about duplicate points as it seems your duplicate lookup maybe whats slow.
For DISTINCT to work efficiently, you'll need a database index on those columns for which you want to eliminate duplicates (e.g. on the latitude/longitude columns, or even on the set of all columns).
So first insert all data into your temp table, then CREATE INDEX (this is usually faster that creating the index beforehand, as maintaining it during insertion is costly), and only afterwards do the INSERT INTO ... SELECT DISTINCT.
An EXPLAIN <your query> can tell you whether the SELECT DISTINCT now uses the index.

Postgres hstore for time series

I am new to postgres and am experimenting with the hstore extension.Looking for some guidance. I need to support basic reporting on timeseries data for various products that we sell. I have a large amount data in the format "Timestamp, Value" for each product. This data is available in a csv fle for each product.
I am thinking of using hstore to store this data in the key value format. Assuming that all the timeseries data for a single product can be stored in a single hstore object. I need to be able to query this data by specific times, say what was the value of a product at a given time? Also need to run simple queries like retrieving the times where the product costed more than $100.
I'm planning to have a table with a product id column and an hstore column. But I am not very clear on how to make this work:
The hstore column needs to be loaded from thousands of timestamp,value records that exist in a csv. The hstore should be appended whenever we get a new csv.
The table needs to store the productId and corresponding Timeseries data.
Can you please advise if using hstore would be helpful ? If yes then how can I load data from csv as explained above. Also, if there could be any impact on the performance on inserts/updates in the hstore, as data grows please share your experiences.
I do think you should start with a simple, normalised schema first, especially since you are new to PostgreSQL. Something like:
CREATE TABLE product_data
(
product TEXT, -- I'm making an assumption about the types of your columns
time TIMESTAMP,
value DOUBLE PRECISION,
PRIMARY KEY (product, time);
);
I would definitely keep hstore and similar options in mind, if and when your data becomes large enough that efficiency is more important and simplicity. But note that all options have an efficiency tradeoff.
Do you know how much data you're going to support? Number of products, number of distinct timestamps for each product?
What other queries do you want to run? A query for the times where a single product cost more than $100 would benefit from an index on (product, value), if the product has many distinct timestamps.
Other options
hstore is most useful if you want to store a table set of arbitrary key-value pairs in a row. You could use it here, with a row for each product, and each distinct timestamp for that product being a key in the product's table. The downsides are that keys and values in hstore are text, whereas your keys are timestamps, and your values are numbers of some kind. So there will be a certain reduction in type checking, and a certain increase in type casting cost required. Another possible downside is that some queries on the hstore might not use indexes very efficiently. The above table can use simple btree indexes for range queries (say you want to pull out the values between two dates for a product). But hstore indexes are much more limited; you can use a gist or gin index on an hstore column to find all the rows that feature a certain key.
Another option (which I've played with and use experimentally for some of my databases) is arrays. Basically, each product will have an array of values, and each timestamp maps to an index in the array. This is easy if the timestamps are perfectly regular. For example, if all your products had a value every hour for every day, you could use a table like this:
CREATE TABLE product_data
(
product TEXT,
day DATE,
values DOUBLE PRECISION[], -- An array from 0 to 23.
PRIMARY KEY (product, day);
);
You can construct views and indexes to make querying this table moderate easy. (I wrote a blog post on this technique at http://ejrh.wordpress.com/2011/03/20/vector-denormalisation-in-postgresql/.)
But my advice is still: start with a simple table, then explore ways to improve efficiency when you know you're going to need them.

I have a massive table that I need to optimize. I think I need to use indexes, but I was hoping for some more information about them

So I have a large table that I query (select only) quite frequently. The table is around 12,000 rows long. Since the advent of iOS, the time that it is taking to run these select queries has gone up 4-5x.
I was told that I need to add an index to my table. The query that I am using looks like this:
SELECT * FROM book_content WHERE book_id = ? AND chapter = ? ORDER BY verse ASC
How can I create an index for this table? Is it a command I just run once? What exactly is the index going to do? I didn't learn about these in school so they still seem like some sort of magic to me at this point, so I was hoping to get a little instruction.
Thanks!
You want an index on book_id and chapter. Without an index, a server would do a table scan and essentially load the entire table into memory to do its search. Do a quick search on the CREATE INDEX command for the RDBMS that you are using. You create the index once and every time you do an INSERT or DELETE or UPDATE, the server will update the index automatically. An index can be UNIQUE and it can be on multiple fields (in your case, book_id and chapter). If you make it UNIQUE, the database will not allow you to insert a second row with the same key (in this case, book_id and chapter). On most servers, having one index on two fields is different from having two individual indexes on single fields each.
A Mysql example would be:
CREATE INDEX id_chapter_idx ON book_content (book_id,chapter);
If you want only one record for each book_id, chapter combination, use this command:
CREATE UNIQUE INDEX id_chapter_idx ON book_content (book_id,chapter);
A PRIMARY INDEX is a special index that is UNIQUE and NOT NULL. Each table can only have one primary index. In fact, each table should have one primary index to ensure table integrity, especially during joins.
You don't have to think of indexes as "magic".
An index on an SQL table is much like the index in a printed book - it lets you find what you're looking for without reading the entire book cover-to-cover.
For example, say you have a cookbook, and you're looking for recipes that involve chicken. The index in the back of the book might say something like:
chicken: 30,34,72,84
letting you know that you will find chicken recipes on those 4 pages. It's much faster to find this information in the index than by reading through the whole book, because the index is shorter, and (more importantly) it's in alphabetical order, so you can quickly find the right place in the index.
So, in general you want to create indexes on columns that you will regularly need to query (book_id and chapter, in your example).
When you declare a column as primary key automatically generates an index on that column. In your case for using more often select an index is ideal, because they improve time of selection queries and degrade the time of insertion. So you can create the indexes you think you need without worrying about the performance
Indexes are a very sensitive subject. If you consider using them, you need to be very careful how many you make. The primary key, or id, of each table should have a clustered index. All the rest, it depends on how you plan to use them. I'm very fuzzy in the subject of indexes, and have actually never worked with them, but from a seminar I just watched actually yesterday, you don't want too many indexes - because they can actually slow things down when you don't need to use them.
Let's say you put an index on 5 out of 8 fields on a table. Each index is designated for a particular query somewhere in your software. Well, when 1 query is run, it uses that 1 index, and doesn't need the other 4. So that's unneeded weight on this 1 query. If you need an index, be sure that this is an index which could be useful in many places, not just 1 place.