In DB2 data studio while extracting a Explain plan please confirm if it depends on table's data.
Let's say, I have one table with 500 records in testing environment and the same table has 50000 records in production database. So if I extract the explain plan that is using same table then will it give me same cost or different cost of query.
Please let me know if more information is required.
The calculated query cost depends on many things, including table statistics, possibly updated in real-time, database configuration parameters, and hardware characteristics. This means that the plan costs, as well as plans themeselves, are unlikely to be the same in different environments.
EDIT: Statistics about data, such as the number of rows in a table, the number of distinct values in a column etc. are updated by a special utility, RUNSTATS, and you need to ensure that it runs regularly to reflect changes to the data. If statistics are not updated (or never collected) the optimizer will know nothing about the data metrics and will be forced to make guesses, often resulting in suboptimal performance. In some cases when the optimizer discovers that the estimated statistics differ from the actual results of a query, it can trigger automatic statistics update.
Related
As postgresql documents points out one way to increase query performance is to increase statistics target for some columns.
it is known that default_statistics_target value is not enough for large tables (a few million row) with irregular value distribution and must be increased.
it seams practical to create a script for auto-tuning statistics target for each column, i would like to know what are possible obstacles in writing such script and why i can't find such script online.
That's because it is not that simple. It does not primarily depend on the size of the table, but on the data in the table and their distribution, the way in which data are modified, and most of all on the queries.
So it is pretty much impossible to make that decision from a look on the persisted state, and even with more information it would require quite a bit of artificial intelligence.
One problem with PG planner statistics is that there is no way to compute statistics over all the rows inside the table. PG use always a small part of the table to compute statistics (sample percent). This way have a huge disavantage in big table: it will ignore some important values that can make the difference when estimating the cardinality for some operations of the execution plan. This may cause the use of an innappropiate algorithm.
Explanation : http://mssqlserver.fr/postgresql-vs-sql-server-mssql-part-3-very-extremely-detailed-comparison/
Especially § 12 – Planer statistics
The reason that PG do not accept a "full scan" stat, is because it will take too much time to compute ! In fact, PostgreSQL is very slow in many maintenance task such as statistics recompute, as I reveal it here :
http://mssqlserver.fr/postgresql-vs-microsoft-part-1-dba-queries-performances/
In some other RDBMS it is possible to do UPDATE STATISTICS ... WITH FULLSCAN (Microsoft SQL Server as an example) is this does not take so much time, because MS SQL Server does it in parallel with multiple threads that PostGreSQL is unnable to do...
Conclusion: PostGreSQL has never been design for huge table. Think to use another RDBMS if you want to deal with big table and have performances...
Just take a look over COUNT performances of PostGreSQL compare to MS SQL Server:
http://mssqlserver.fr/postgresql-vs-microsoft-sql-server-comparison-part-2-count-performances/
I have a typical star pattern in my Azure SQL Data Warehouse. Data is first dumped into staging tables via Data Factory, then it calls a master procedure that calls other procedures to transform data into the appropriate format and then clear out the staging tables for that chunk of data.
Should these staging tables have indexes? Should they have statistics? I recently upgraded to Gen 2, but don't have auto create statistics turned on. I worry that statistics will get created but not updated, and so will end up slowing things down more than anything.
For more context, there is a procedure to rebuild indexes and update statistics which is run overnight, once a day. The data load process is run hourly.
Given that these are staging tables, the biggest impacts will come from the following.
Where possible, use a hash distribution. This will give best performance when you process the table in subsequent steps. While documentation sometimes suggests round_robin distribution, and this is slightly faster for ingestion, the next query on the table will be slower.
Always use statistics. I suggest creating them manually, based on expected usage, for greater predictability in your ELT performance. If you don't create and update statistics you're going to get dreadful performance at some time in future. If you don't want to undertake the effort of manually managing statistics, then definitely turn on auto statistics.
Consider the use of HEAP vs CLUSTERED COLUMNSTORE table structures for staging tables. In general, staging tables are processed on a whole-row basis, and you may find that your performance is better at the staging layer if you use a HEAP. This needs to be tested on your data, as the Gen2 caching that gives much greater performance does not apply to Heap tables.
Definitely create your fact and dimension tables as clustered columnstore indexes. Hash distribute your fact/s, and replicate your dimensions (unless you have billion row dimensions, in which case a hash distribution may be more appropriate).
If you're using CTAS algorithms your need for non-clustered indexes should be very low. I generally add indexes only when I see a performance problem with a query that I can't solve by any other technique.
Finally, make sure that you're using a reasonable DWU and Resource Class. A general rule of thumb is that you shouldn't be running your ELT at less than DWU500, and LARGERC. If you don't do this, you'll find that you get bad clustered columnstore indexes which will lead to future performance problems.
Some input from my side -
Your fact table should be partitioned . in fact you should have a job which creates the partitions in fact automatically .
how big is fact table ? if your fact table is becoming too big then based on your requirement you can think of introducing archiving of old table if its not required in fact table .
We have a system where we do some aggregations in Redshift based on some conditions. We aggregate this data with complex joins which usually takes about 10-15 minutes to complete. We then show this aggregated data on Tableau to generate our reports.
Lately, we are getting many changes regarding adding a new dimension ( which usually requires join with a new table) or get data on some more specific filter. To entertain these requests we have to change our queries everytime for each of our subprocesses.
I went through OLAP a little bit. I just want to know if it would be better in our use case or is there any better way to design our system to entertain such adhoc requests which does not require developer to change things everytime.
Thanks for the suggestions in advance.
It would work, rather it should work. Efficiency is the key here. There are few things which you need to strictly monitor to make sure your system (Redshift + Tableau) remains up and running.
Prefer Extract over Live Connection (in Tableau)
Live connection would query the system everytime someone changes the filter or refreshes the report. Since you said the dataset is large and queries are complex, prefer creating an extract. This'll make sure data is available upfront whenever someone access your dashboard .Do not forget to schedule the extract refresh, other wise the data will be stale forever.
Write efficient queries
OLAP systems are expected to query a large dataset. Make sure you write efficient queries. It's always better to first get a small dataset and join them rather than bringing everything in the memory and then joining / using where clause to filter the result.
A query like (select foo from table1 where ... )a left join (select bar from table2 where) might be the key at times where you only take out small and relevant data and then join.
Do not query infinite data.
Since this is analytical and not transactional data, have an upper bound on the data that Tableau will refresh. Historical data has an importance, but not from the time of inception of your product. Analysing the data for the past 3, 6 or 9 months can be the key rather than querying the universal dataset.
Create aggregates and let Tableau query that table, not the raw tables
Suppose you're analysing user traits. Rather than querying a raw table that captures 100 records per user per day, design a table which has just one (or two) entries per user per day and introduce a column - count which'll tell you the number of times the event has been triggered. By doing this, you'll be querying sufficiently smaller dataset but will be logically equivalent to what you were doing earlier.
As mentioned by Mr Prashant Momaya,
"While dealing with extracts,your storage requires (size)^2 of space if your dashboard refers to a data of size - **size**"
Be very cautious with whatever design you implement and do not forget to consider the most important factor - scalability
This is a typical problem and we tackled it by writing SQL generators in Python. If the definition of the metric is the same (like count(*)) but you have varying dimensions and filters you can declare it as JSON and write a generator that will produce the SQL. Example with pageviews:
{
metric: "unique pageviews"
,definition: "count(distinct cookie_id)"
,source: "public.pageviews"
,tscol: "timestamp"
,dimensions: [
['day']
,['day','country']
}
can be relatively easy translated to 2 scripts - this:
drop table metrics_daily.pageviews;
create table metrics_daily.pageviews as
select
date_trunc('day',"timestamp") as date
,count(distinct cookie_id) as "unique_pageviews"
from public.pageviews
group by 1;
and this:
drop table metrics_daily.pageviews_by_country;
create table metrics_daily.pageviews_by_country as
select
date_trunc('day',"timestamp") as date
,country
,count(distinct cookie_id) as "unique_pageviews"
from public.pageviews
group by 1,2;
the amount of complexity of a generator required to produce such sql from such config is quite low but in increases exponentially as you need to add new joins etc. It's much better to keep your dimensions in the encoded form and just use a single wide table as aggregation source, or produce views for every join you might need and use them as sources.
I am currently testing Redshift for a SaaS near-realtime analytics application.
The queries performance are fine on a 100M rows dataset.
However, the concurrency limit of 15 queries per cluster will become a problem when more users will be using the application at the same time.
I cannot cache all aggregated results since we authorize to customize filters on each query (ad-hoc querying)
The requirements for the application are:
queries must return results within 10s
ad-hoc queries with filters on more than 100 columns
From 1 to 50 clients connected at the same time on the application
dataset growing at 10M rows / day rate
typical queries are SELECT with aggregated function COUNT, AVG with 1 or 2 joins
Is Redshift not correct for this use case? What other technologies would you consider for those requirements?
This question was also posted on the Redshift Forum. https://forums.aws.amazon.com/thread.jspa?messageID=498430񹫾
I'm cross-posting my answer for others who find this question via Google. :)
In the old days we would have used an OLAP product for this, something like Essbase or Analysis Services. If you want to look into OLAP there is an very nice open source implementation called Mondrian that can run over a variety of databases (including Redshift AFAIK). Also check out Saiku for an OSS browser based OLAP query tool.
I think you should test the behaviour of Redshift with more than 15 concurrent queries. I suspect that it will not be user noticeable as the queries will simply queue for a second or 2.
If you prove that Redshift won't work you could test Vertica's free 3-node edition. It's a bit more mature than Redshift (i.e. it will handle more concurrent users) and much more flexible about data loading.
Hadoop/Impala is overly complex for a dataset of your size, in my opinion. It is also not designed for a large number of concurrent queries or short duration queries.
Shark/Spark is designed for the case where you data is arriving quickly and you have a limited set of metrics that you can pre-calculate. Again this does not seem to match your requirements.
Good luck.
Redshift is very sensitive to the keys used in joins and group by/order by. There are no dynamic indexes, so usually you define your structure to suit the tasks.
What you need to ensure is that your joins match the structure 100%. Look at the explain plans - you should not have any redistribution or broadcasting, and no leader node activities (such as Sorting). It sounds like the most critical requirement considering the amount of queries you are going to have.
The requirement to be able to filter/aggregate on arbitrary 100 columns can be a problem as well. If the structure (dist keys, sort keys) don't match the columns most of the time, you won't be able to take advantage of Redshift optimisations. However, these are scalability problems - you can increase the number of nodes to match your performance, you just might be surprised of the costs of the optimal solution.
This may not be a serious problem if the number of projected columns is small, otherwise Redshift will have to hold large amounts of data in memory (and eventually spill) while sorting or aggregating (even in distributed manner), and that can again impact performance.
Beyond scaling, you can always implement sharding or mirroring, to overcome some queue/connection limits, or contact AWS support to have some limits lifted
You should consider pre-aggregation. Redshift can scan billions of rows in seconds as long as it does not need to do transformations like reordering. And it can store petabytes of data - so it's OK if you store data in excess
So in summary, I don't think your use case is not suitable based on just the definition you provided. It might require work, and the details depend on the exact usage patterns.
I have many read-only tables in a Postgres database. All of these tables can be queried using any combination of columns.
What can I do to optimize queries? Is it a good idea to add indexes to all columns to all tables?
Columns that are used for filtering or joining (or, to a lesser degree, sorting) are of interest for indexing. Columns that are just selected are barely relevant!
For the following query only indexes on a and e may be useful:
SELECT a,b,c,d
FROM tbl_a
WHERE a = $some_value
AND e < $other_value;
Here, f and possibly c are candidates, too:
SELECT a,b,c,d
FROM tbl_a
JOIN tbl_b USING (f)
WHERE a = $some_value
AND e < $other_value
ORDER BY c;
After creating indexes, test to see if they are actually useful with EXPLAIN ANALYZE. Also compare execution times with and without the indexes. Deleting and recreating indexes is fast and easy. There are also parameters to experiment with EXPLAIN ANALYZE. The difference may be staggering or nonexistent.
As your tables are read only, index maintenance is cheap. It's merely a question of disk space.
If you really want to know what you are doing, start by reading the docs.
If you don't know what queries to expect ...
Try logging enough queries to find typical use cases. Log queries with the parameter log_statement = all for that. Or just log slow queries using log_min_duration_statement.
Create indexes that might be useful and check the statistics after some time to see what actually gets used. PostgreSQL has a whole infrastructure in place for monitoring statistics. One convenient way to study statistics (and many other tasks) is pgAdmin where you can chose your table / function / index and get all the data on the "statistics" tab in the object browser (main window).
Proceed as described above to see if indexes in use actually speed things up.
If the query planner should chose to use one or more of your indexes but to no or adverse effect then something is probably wrong with your setup and you need to study the basics of performance optimization: vacuum, analyze, cost parameters, memory usage, ...
If you have filtering by more columns indexes may help but not too much. Also indexes may not help for small tables.
First search for "postgresql tuning" - you will find usefull information.
If database can fit in memory - buy enough RAM.
If database can not fit in memory - SSD will help.
If this is not enough and database is read only - run 2, 3 or more servers. Or partition database (in the best case to fit in memory of each server).
Even if queries are generated I think they will not be random. Monitor database for slow queries and improve only them.