Getting Over to Someone Why Views Can't Be Used as a Tables - postgresql

For some reason I'm having a hard time getting over to some people that using a view in Postgres as you would use a table, is a bad idea.
As some background, there are a number of tables containing completely static data that is updated every few months via a batch import into different tables by date - table_201603 or table_201607. A view has then been created called 'table' which clients then use which is just a 'SELECT * FROM' of the table. When an updated batch of data is put into a new table the view is then updated to point at the new table. This means an in-place rename of the table does not need to take place that might mean downtime. This is in a version of Postgres before 9.3 where materialized views came in, just to clarify. These tables generally have about 100 million rows in them.
This is understandably leading to some confusing results when people are querying these views with very inconsistent query times. Sometimes queries are taking seconds, other times 20 or 30 milliseconds.
Additional: This is geospatial data, so they're doing geospatial queries on a view.
I know what many of the pitfalls here are - views are created on-the-fly like a sub-query, you're very much at the whim of the query planner as to what predicates get brought down and how long results are cached as results aren't physically stored as tables - but can anyone see anything else and suggest a better way of doing this? I can imagine this would be a reasonably common scenario so it might help others.
Thanks,

In general, this reminds me a use case for synonym. However, there are no synonyms in Postgres and they recommend using Views and or separation by schema
https://www.postgresql.org/message-id/kon2r2$mo6$1#ger.gmane.org

Related

How to cache duplicate queries in PostgreSQL

I have some extensive queries (each of them lasts around 90 seconds). The good news is that my queries are not changed a lot. As a result, most of my queries are duplicate. I am looking for a way to cache the query result in PostgreSQL. I have searched for the answer but I could not find it (Some answers are outdated and some of them are not clear).
I use an application which is connected to the Postgres directly.
The query is a simple SQL query which return thousands of data instance.
SELECT * FROM Foo WHERE field_a<100
Is there any way to cache a query result for at least a couple of hours?
It is possible to cache expensive queries in postgres using a technique called a "materialized view", however given how simple your query is I'm not sure that this will give you much gain.
You may be better caching this information directly in your application, in memory. Or if possible caching a further processed set of data, rather than the raw rows.
ref:
https://www.postgresql.org/docs/current/rules-materializedviews.html
Depending on what your application looks like, a TEMPORARY TABLE might work for you. It is only visible to the connection that created it and it is automatically dropped when the database session is closed.
CREATE TEMPORARY TABLE tempfoo AS
SELECT * FROM Foo WHERE field_a<100;
The downside to this approach is that you get a snapshot of Foo when you create tempfoo. You will not see any new data that gets added to Foo when you look at tempfoo.
Another approach. If you have access to the database, you may be able to significantly speed up your queries by adding and index on on field

postgres many tables vs one huge table

I am using postgresql db.
my application manages many objects of the same type.
for each object my application performs intense db writing - each object has a line inserted to db at least once every 30 seconds. I also need to retrieve the data by object id.
my question is how it's best to design the database? use one huge table for all the objects (slower inserts) or use table for each object (more complicated retrievals)?
Tables are meant to hold a huge number of objects of the same type. So, your second option, that is one table per object, doesn't seem to look right. But of course, more information is needed.
My tip: start with one table. If you run into problems - mainly performance - try to split it up. It's not that hard.
Logically, you should use one table.
However, so called "write amplification" problem exhibited by PostgreSQL seems to have been one of the main reasons why Uber switeched from PostgreSQL to MySQL. Quote:
"For tables with a large number of secondary indexes, these
superfluous steps can cause enormous inefficiencies. For instance, if
we have a table with a dozen indexes defined on it, an update to a
field that is only covered by a single index must be propagated into
all 12 indexes to reflect the ctid for the new row."
Whether this is a problem for your workload, only measurement can tell - I'd recommend starting with one table, measuring performance, and then switching to multi-table (or partitioning, or perhaps switching the DBMS altogether) only if the measurements justify it.
A single table is probably the best solution if you are certain that all objects will continue to have the same attributes.
INSERT does not get significantly slower as the table grows – it is the number of indexes that slows down data modification.
I'd rather be worried about data growth. Do you have a design for getting rid of old data? Big DELETEs can be painful; sometimes partitioning helps.

Cron job + new table vs materialized views for webapp analytics?

Hi
I am in the process of adding analytics to my SaaS app, and I'd love to hear other people's experiences doing this.
Current I see two different approches:
Do most of the data handling at the DB level, building and aggregating data into materialized views for performance boost. This way the data will stay normalized.
Have different cronjobs/processes that will run at different intervals (10 min, 1 hour etc.) that will query the database and insert aggregate results into a new table. In this case, the metrics/analytics are denormalized.
Which approach makes the most sense, maybe something completely different?
On really big data, the cronjob or ETL is the only option. You read the data once, aggregate it and never go back. Querying aggregated data is then relatively cheap.
Views will go through tables. If you use "explain" for a view-based query, you might see the data is still being read from tables, possibly using indexes (if corresponding indexes exist). Querying terabytes of data this way is not viable.
The only problem with the cronjob/ETL approach is that it's PITA to maintain. If you find a bug on production environment - you are screwed. You might spend days and weeks fixing and recalculating aggregations. Simply said: you have to get it right the first time :)

Postgres partitioning?

My software runs a cronjob every 30 minutes, which pulls data from Google Analytics / Social networks and inserts the results into a Postgres DB.
The data looks like this:
url text NOT NULL,
rangeStart timestamp NOT NULL,
rangeEnd timestamp NOT NULL,
createdAt timestamp DEFAULT now() NOT NULL,
...
(various integer columns)
Since one query returns 10 000+ items, it's obviously not a good idea to store this data in a single table. At this rate, the cronjob will generate about 480 000 records a day and about 14.5 million a month.
I think the solution would be using several tables, for example I could use a specific table to store data generated in a given month: stats_2015_09, stats_2015_10, stats_2015_11 etc.
I know Postgres supports table partitioning. However, I'm new to this concept, so I'm not sure what's the best way to do this. Do I need partitioning in this case, or should I just create these tables manually? Or maybe there is a better solution?
The data will be queried later in various ways, and those queries are expected to run fast.
EDIT:
If I end up with 12-14 tables, each storing 10-20 millions rows, Postgres should be still able to run select statements quickly, right? Inserts don't have to be super fast.
Partitioning is a good idea under various circumstances. Two that come to mind are:
Your queries have a WHERE clause that can be readily mapped onto one or a handful of partitions.
You want a speedy way to delete historical data (dropping a partition is faster than deleting records).
Without knowledge of the types of queries that you want to run, it is difficult to say if partitioning is a good idea.
I think I can say that splitting the data into different tables is a bad idea because it is a maintenance nightmare:
You can't have foreign key references into the table.
Queries spanning multiple tables are cumbersome, so simple questions are hard to answer.
Maintaining tables becomes a nightmare (adding/removing a column).
Permissions have to be carefully maintained, if you have users with different roles.
In any case, the place to start is with Postgres's documentation on partitioning, which is here. I should note that Postgres's implementation is a bit more awkward than in other databases, so you might want to review the documentation for MySQL or SQL Server to get an idea of what it is doing.
Firstly, I would like to challenge the premise of your question:
Since one query returns 10 000+ items, it's obviously not a good idea to store this data in a single table.
As far as I know, there is no fundamental reason why the database would not cope fine with a single table of many millions of rows. At the extreme, if you created a table with no indexes, and simply appended rows to it, Postgres could simply carry on writing these rows to disk until you ran out of storage space. (There may be other limits internally, I'm not sure; but if so, they're big.)
The problems only come when you try to do something with that data, and the exact problems - and therefore exact solutions - depend on what you do.
If you want to regularly delete all rows which were inserted more than a fixed timescale ago, you could partition the data on the createdAt column. The DELETE would then become a very efficient DROP TABLE, and all INSERTs would be routed through a trigger to the "current" partition (or could even by-pass it if your import script was aware of the partition naming scheme). SELECTs, however, would probably not be able to specify a range of createAt values in their WHERE clause, and would thus need to query all partitions and combine the results. The more partitions you keep around at a time, the less efficient this would be.
Alternatively, you might examine the workload on the table and see that all queries either already do, or easily can, explicitly state a rangeStart value. In that case, you could partition on rangeStart, and the query planner would be able to eliminate all but one or a few partitions when planning each SELECT query. INSERTs would need to be routed through a trigger to the appropriate table, and maintenance operations (such as deleting old data that is no longer needed) would be much less efficient.
Or perhaps you know that once rangeEnd becomes "too old" you will no longer need the data, and can get both benefits: partition by rangeEnd, ensure all your SELECT queries explicitly mention rangeEnd, and drop partitions containing data you are no longer interested in.
To borrow Linus Torvald's terminology from git, the "plumbing" for partitioning is built into Postgres in the form of table inheritance, as documented here, but there is little in the way of "porcelain" other than examples in the manual. However, there is a very good extension called pg_partman which provides functions for managing partition sets based on either IDs or date ranges; it's well worth reading through the documentation to understand the different modes of operation. In my case, none quite matched, but forking that extension was significantly easier than writing everything from scratch.
Remember that partitioning does not come free, and if there is no obvious candidate for a column to partition by based on the kind of considerations above, you may actually be better off leaving the data in one table, and considering other optimisation strategies. For instance, partial indexes (CREATE INDEX ... WHERE) might be able to handle the most commonly queried subset of rows; perhaps combined with "covering indexes", where Postgres can return the query results directly from the index without reference to the main table structure ("index-only scans").

Performance improvement for fetching records from a Table of 10 million records in Postgres DB

I have a analytic table that contains 10 million records and for producing charts i have to fetch records from analytic table. several other tables are also joined to this table and data is fetched currently But it takes around 10 minutes even though i have indexed the joined column and i have used Materialized views in Postgres.But still performance is very low it takes 5 mins for executing the select query from Materialized view.
Please suggest me some technique to get the result within 5sec. I dont want to change the DB storage structure as so much of code changes has to be done to support it. I would like to know if there is some in built methods for query speed improvement.
Thanks in Advance
In general you can take care of this issue by creating a better data structure(Most engines do this to an extent for you with keys).
But if you were to create a sorting column of sorts. and create a tree like structure then you'd be left to a search rate of (N(log[N]) rather then what you may be facing right now. This will ensure you always have a huge speed up in your searches.
This is in regards to binary tree's, Red-Black trees and so on.
Another implementation for a speedup may be to make use of something allong the lines of REDIS, ie - a nice database caching layer.
For analytical reasons in the past I have also chosen to make use of technologies related to hadoop. Though this may be a larger migration in your case at this point.