Looking to simplify code and execute most effectively
I need to write SQL for reporting that has week-to-date, month-to-date, and year-to-date levels of the same data. Which tactic would execute faster and be more simplistic: Write sub-queries for each of the time periods or pull the lowest level of detail and use the partition over functionality within CASE statements?
Using POSTGRESQL 8.0 and AWS tables
I am leaning towards the sub queries tactic.
Related
We have a system where we do some aggregations in Redshift based on some conditions. We aggregate this data with complex joins which usually takes about 10-15 minutes to complete. We then show this aggregated data on Tableau to generate our reports.
Lately, we are getting many changes regarding adding a new dimension ( which usually requires join with a new table) or get data on some more specific filter. To entertain these requests we have to change our queries everytime for each of our subprocesses.
I went through OLAP a little bit. I just want to know if it would be better in our use case or is there any better way to design our system to entertain such adhoc requests which does not require developer to change things everytime.
Thanks for the suggestions in advance.
It would work, rather it should work. Efficiency is the key here. There are few things which you need to strictly monitor to make sure your system (Redshift + Tableau) remains up and running.
Prefer Extract over Live Connection (in Tableau)
Live connection would query the system everytime someone changes the filter or refreshes the report. Since you said the dataset is large and queries are complex, prefer creating an extract. This'll make sure data is available upfront whenever someone access your dashboard .Do not forget to schedule the extract refresh, other wise the data will be stale forever.
Write efficient queries
OLAP systems are expected to query a large dataset. Make sure you write efficient queries. It's always better to first get a small dataset and join them rather than bringing everything in the memory and then joining / using where clause to filter the result.
A query like (select foo from table1 where ... )a left join (select bar from table2 where) might be the key at times where you only take out small and relevant data and then join.
Do not query infinite data.
Since this is analytical and not transactional data, have an upper bound on the data that Tableau will refresh. Historical data has an importance, but not from the time of inception of your product. Analysing the data for the past 3, 6 or 9 months can be the key rather than querying the universal dataset.
Create aggregates and let Tableau query that table, not the raw tables
Suppose you're analysing user traits. Rather than querying a raw table that captures 100 records per user per day, design a table which has just one (or two) entries per user per day and introduce a column - count which'll tell you the number of times the event has been triggered. By doing this, you'll be querying sufficiently smaller dataset but will be logically equivalent to what you were doing earlier.
As mentioned by Mr Prashant Momaya,
"While dealing with extracts,your storage requires (size)^2 of space if your dashboard refers to a data of size - **size**"
Be very cautious with whatever design you implement and do not forget to consider the most important factor - scalability
This is a typical problem and we tackled it by writing SQL generators in Python. If the definition of the metric is the same (like count(*)) but you have varying dimensions and filters you can declare it as JSON and write a generator that will produce the SQL. Example with pageviews:
{
metric: "unique pageviews"
,definition: "count(distinct cookie_id)"
,source: "public.pageviews"
,tscol: "timestamp"
,dimensions: [
['day']
,['day','country']
}
can be relatively easy translated to 2 scripts - this:
drop table metrics_daily.pageviews;
create table metrics_daily.pageviews as
select
date_trunc('day',"timestamp") as date
,count(distinct cookie_id) as "unique_pageviews"
from public.pageviews
group by 1;
and this:
drop table metrics_daily.pageviews_by_country;
create table metrics_daily.pageviews_by_country as
select
date_trunc('day',"timestamp") as date
,country
,count(distinct cookie_id) as "unique_pageviews"
from public.pageviews
group by 1,2;
the amount of complexity of a generator required to produce such sql from such config is quite low but in increases exponentially as you need to add new joins etc. It's much better to keep your dimensions in the encoded form and just use a single wide table as aggregation source, or produce views for every join you might need and use them as sources.
From PostgreSQL 9.6 Release Notes:
Only strictly read-only queries where the driving table is accessed via a sequential scan can be parallelized.
My question is: If a CTE (WITH clause) contains only read operations, but its results is used to feed a writing operation, like an insert or update, is it also disallowed to parallelize sequential scans?
I mean, as CTE is much like a temporary table which only exists for currently executing query, can I suppose that its inner query can take advantage of the brand new parallel seq-scan of PostgreSQL 9.6? Or, otherwise, is it treated as a using subquery and cannot perform parallel scan?
For example, consider this query:
WITH foobarbaz AS (
SELECT foo FROM bar
WHERE some_expensive_function(baz)
)
DELETE FROM bar
USING foobarbaz
WHERE bar.foo = foobarbaz.foo
;
Is that foobarbaz calculation expected to be able to be parallelized or is it disallowed because of the delete sentence?
If it isn't allowed, I thought that can replace the CTE by a CREATE TEMPORARY TABLE statement. But I think I will fall into the same issue as CREATE TABLE is a write operation. Am I wrong?
Lastly, a last chance I could try is to perform it as a pure read operation and use its result as input for insert and / or update operations. Outside of a transaction it should work. But the question is: If the read operation and the insert/update are between a begin and commit sentences, it not will be allowed anyway? I understand they are two completely different operations, but in the same transaction and Postgres.
To be clear, my concern is that I have an awful mass of hard-to-read and hard-to-redesign SQL queries that involves multiple sequential scans with low-performance function calls and which performs complex changes over two tables. The whole process runs in a single transaction because, if not, the mess in case of failure would be totally unrecoverable.
My hope is to able to parallelize some sequential scans to take advantage of the 8 cpu cores of the machine to be able to complete the process sooner.
Please, don't answer that I need to fully redesign that mess: I know and I'm working on it. But it is a great project and we need to continue working meantime.
Anyway, any suggestion will be thankful.
EDIT:
I add a brief report of what I could discover up to now:
As #a_horse_with_no_name says in his comment (thanks), CTE and the rest of the query is a single DML statement and, if it has a write operation, even outside of the CTE, then the CTE cannot be parallelized (I also tested it).
Also I found this wiki page with more concise information about parallel scans than what I found in the release notes link.
An interesting point I could check thanks to that wiki page is that I need to declare the involved functions as parallel safe. I did it and worked (in a test without writings).
Another interesting point is what #a_horse_with_no_name says in his second comment: Using DbLink to perform a pure read-only query. But, investigating a bit about that, I seen that postgres_fdw, which is explicitly mentioned in the wiki as non supporting parallel scans, provides roughly the same functionality using a more modern and standards-compliant infrastructure.
And, on the other hand, even if it would worked, I were end up getting data from outside the transaction which, in some cases would be acceptable for me but, I think, not as good idea as general solution.
Finally, I checked that is possible to perform a parallel-scan in a read-only query inside a transaction, even if it later performs write operations (no exception is triggered and I could commit).
...in summary, I think that my best bet (if not the only one) would be to refactor the script in a way that it reads the data to memory before to later perform the write operations in the same transaction.
It will increase I/O overhead but, attending the latencies I manage it will be even better.
I have some complex queries that will produce same result. The only difference is execution order. For example, a query performs selection first before join while the other query performs join first, then selection. However, when I read the explanation (on the explain tab, using PgAdmin III), both queries have the same diagram.
Why?
I'm not a pro with explaining this with all the correct terminologies, however essentially the preprocessing attempts to find the most efficient way to execute the statement. It does this by breaking them down into simpler sub statements- just because you write it one way it doesn't mean it is the same order the pre processing will execute the plan. Kind of like precedence with arithmetic (brackets, multiply, divide, etc).
Certain operations will influence the statement order of execution enabling you to "tune" your queries to make them more efficient. http://www.postgresql.org/docs/current/interactive/performance-tips.html
I have this process that has to make a series of queries, using pl/pgsql:
--process:
SELECT function1();
SELECT function2();
SELECT function3();
SELECT function4();
To be able to execute everything in one call, I created a process function as such:
CREATE OR REPLACE FUNCTION process()
RETURNS text AS
$BODY$
BEGIN
PERFORM function1();
PERFORM function2();
PERFORM function3();
PERFORM function4();
RETURN 'process ended';
END;
$BODY$
LANGUAGE plpgsql
The problem is, when I sum the time that each function takes by itself, the total is 200 seconds, while the time that the function process() takes is more than one hour!
Maybe it's a memory issue, but I don't know which configuration on postgresql.conf should I change.
The DB is running on PostgreSQL 9.4, in a Debian 8.
You commented that the 4 functions have to run consecutively. So it's safe to assume that each function works with data from tables that have been modified by the previous function. That's my prime suspect.
Any Postgres function runs inside the transaction of the outer context. So all functions share the same transaction context if packed into another function. Each can see effects on data from previous functions, obviously. (Even though effects are still invisible to other concurrent transactions.) But statistics are not updated immediately.
Query plans are based on statistics on involved objects. PL/pgSQL does not plan statements until they are actually executed, that would work in your favor. Per documentation:
As each expression and SQL command is first executed in the function,
the PL/pgSQL interpreter parses and analyzes the command to create a
prepared statement, using the SPI manager's SPI_prepare function.
PL/pgSQL can cache query plans, but only within the same session and (in pg 9.2+ at least) only after a couple of executions have shown the same query plan to work best repeatedly. If you suspect this going wrong for you, you can work around it with dynamic SQL which forces a new plan every time:
EXECUTE 'SELECT function1()';
However, the most likely candidate I see is invalidated statistics that lead to inferior query plans. SELECT / PERFORM statements (same thing) inside the function are run in quick succession, there is no chance for autovacuum to kick in and update statistics between one function and the next. If one function substantially alters data in a table the next function is working with, the next function might base its query plan on outdated information. Typical example: A table with a few rows is filled with many thousands of rows, but the next plan still thinks a sequential scan is fastest for the "small" table. You state:
when I sum the time that each function takes by itself, the total is
200 seconds, while the time that the function process() takes is more
than one hour!
What exactly does "by itself" mean? Did you run them in a single transaction or in individual transactions? Maybe even with some time in between? That would allow autovacuum to update statistics (it's typically rather quick) and possibly lead to completely different query plans based on the changed statistic.
You can inspect query plans inside plpgsql functions with auto-explain
Postgres query plan of a UDF invocation written in pgpsql
If you can identify such an issue, you can force ANALYZE in between statements. Being at it, for just a couple of SELECT / PERFORM statements you might as well use a simpler SQL function and avoid plan caching altogether (but see below!):
CREATE OR REPLACE FUNCTION process()
RETURNS text
LANGUAGE sql AS
$func$
SELECT function1();
ANALYZE some_substantially_affected_table;
SELECT function2();
SELECT function3();
ANALYZE some_other_table;
SELECT function4();
SELECT 'process ended'; -- only last result is returned
$func$;
Also, as long as we don't see the actual code of your called functions, there can be any number of other hidden effects.
Example: you could SET LOCAL ... some configuration parameter to improve the performance of your function1(). If called in separate transactions that won't affect the rest. The effect only last till the end of the transaction. But if called in a single transaction it affects the rest, too ...
Basics:
Difference between language sql and language plpgsql in PostgreSQL functions
PostgreSQL Stored Procedure Performance
Plus: transactions accumulate locks, which binds an increasing amount of resources and may cause increasing friction with concurrent processes. All locks are released at the end of a transaction. It's better to run big functions in separate transactions if at all possible, not wrapped in a single function (and thus transaction). That last item is related to what #klin and IMSoP already covered.
Warning for future readers (2015-05-30).
The technique described in the question is one of the smartest ways to effectively block the server.
In some corporations the use of this technology can meet with the reward in the form of immediate termination of the employment contract.
Attempts to improve this method are useless. It is simple, beautiful and sufficiently effective.
In RDMS the support of transactions is very expensive. When executing a transaction the server must create and store information on all changes made to the database to make these changes visible in environment (other concurrent processes) in case of a successful completion, and in case of failure, to restore the state before the transaction as soon as possible. Therefore the natural principle affecting server performance is to include in one transaction a minimum number of database operations, ie. only as much as is necessary.
A Postgres function is executed in one transaction. Placing in it many operations that could be run independently is a serious violation of the above rule.
The answer is simple: just do not do it. A function execution is not a mere execution of a script.
In the procedural languages used to write applications there are many other possibilities to simplify the code by using functions or scripts. There is also the possibility to run scripts with shell.
The use a Postgres function for this purpose would make sense if there were a possibility of using transactions within the function. At present, such a possibility does not exist, although discussions on this issue have already long history (you can read about it e.g. in postgres mailing lists).
I'm having a scaling issue with an application that uses a PostgreSQL 9 backend. I have one table who's size is about 40 million records and growing and the conditional queries against it have slowed down dramatically.
To help figure out what's going wrong, I've taken a development snapshot of the database and dump the queries with the execution time into the log.
Now for the confusing part, and the gist of the question ....
The run times for my queries in the log are vastly different (an order of magnitude+) that what I get when I run the 'exact' same query in DbVisualizer to get the explain plan.
I say 'exact' but really the difference is, the application is using a prepared statement to which I bind values at runtime while the queries I run in DbVisualizer has those values in place already. The values themselves are exactly as I pulled them from the log.
Could the use of prepared statements make that big of a difference?
The answer is YES. Prepared statements cut both ways.
On the one hand, the query does not have to be re-planned for every execution, saving some overhead. This can make a difference or be hardly noticeable, depending on the complexity of the query.
On the other hand, with uneven data distribution, a one-size-fits-all query plan may be a bad choice. Called with particular values another query plan could be (much) better suited.
Running the query with parameter values in place can lead to a different query plan. More planning overhead, possibly a (much) better query plan.
Also consider unnamed prepared statements like #peufeu provided. Those re-plan the query considering parameters every time - and you still have safe parameter handling.
Similar considerations apply to queries inside PL/pgSQL functions, where queries can be treated as prepared statements internally - unless executed dynamically with EXECUTE. I quote the manual on Executing Dynamic Commands:
The important difference is that EXECUTE will re-plan the command on
each execution, generating a plan that is specific to the current
parameter values; whereas PL/pgSQL may otherwise create a generic plan
and cache it for re-use. In situations where the best plan depends
strongly on the parameter values, it can be helpful to use EXECUTE to
positively ensure that a generic plan is not selected.
Apart from that, general guidelines for performance optimization apply.
Erwin nails it, but let me add that the extended query protocol allows you to use more flavors of prepared statements. Besides avoiding re-parsing and re-planning, one big advantage of prepared statements is to send parameter values separately, which avoids escaping and parsing overhead, not to mention the opportunity for SQL injections and bugs if you don't use an API that handles parameters in a manner you can't forget to escape them.
http://www.postgresql.org/docs/9.1/static/protocol-flow.html
Query planning for named prepared-statement objects occurs when the
Parse message is processed. If a query will be repeatedly executed
with different parameters, it might be beneficial to send a single
Parse message containing a parameterized query, followed by multiple
Bind and Execute messages. This will avoid replanning the query on
each execution.
The unnamed prepared statement is likewise planned during Parse
processing if the Parse message defines no parameters. But if there
are parameters, query planning occurs every time Bind parameters are
supplied. This allows the planner to make use of the actual values of
the parameters provided by each Bind message, rather than use generic
estimates.
So, if your DB interface supports it, you can use unnamed prepared statements. It's a bit of a middle ground between a query and a usual prepared statement.
If you use PHP with PDO, please note that PDO's prepared statement implementation is rather useless for postgres, since it uses named prepared statements, but re-prepares every time you call prepare(), no plan caching takes place. So you get the worst of both : many roundtrips and plan without parameters. I've seen it be 1000x slower than pg_query() and pg_query_params() on specific queries where the postgres optimizer really needs to know the parameters to produce the optimal plan. pg_query uses raw queries, pg_query_params uses unnamed prepared statements. Usually one is faster than the other, that depends on the size of parameter data.