Staging Table Design for Performance - sqlperformance

I have a typical star pattern in my Azure SQL Data Warehouse. Data is first dumped into staging tables via Data Factory, then it calls a master procedure that calls other procedures to transform data into the appropriate format and then clear out the staging tables for that chunk of data.
Should these staging tables have indexes? Should they have statistics? I recently upgraded to Gen 2, but don't have auto create statistics turned on. I worry that statistics will get created but not updated, and so will end up slowing things down more than anything.
For more context, there is a procedure to rebuild indexes and update statistics which is run overnight, once a day. The data load process is run hourly.

Given that these are staging tables, the biggest impacts will come from the following.
Where possible, use a hash distribution. This will give best performance when you process the table in subsequent steps. While documentation sometimes suggests round_robin distribution, and this is slightly faster for ingestion, the next query on the table will be slower.
Always use statistics. I suggest creating them manually, based on expected usage, for greater predictability in your ELT performance. If you don't create and update statistics you're going to get dreadful performance at some time in future. If you don't want to undertake the effort of manually managing statistics, then definitely turn on auto statistics.
Consider the use of HEAP vs CLUSTERED COLUMNSTORE table structures for staging tables. In general, staging tables are processed on a whole-row basis, and you may find that your performance is better at the staging layer if you use a HEAP. This needs to be tested on your data, as the Gen2 caching that gives much greater performance does not apply to Heap tables.
Definitely create your fact and dimension tables as clustered columnstore indexes. Hash distribute your fact/s, and replicate your dimensions (unless you have billion row dimensions, in which case a hash distribution may be more appropriate).
If you're using CTAS algorithms your need for non-clustered indexes should be very low. I generally add indexes only when I see a performance problem with a query that I can't solve by any other technique.
Finally, make sure that you're using a reasonable DWU and Resource Class. A general rule of thumb is that you shouldn't be running your ELT at less than DWU500, and LARGERC. If you don't do this, you'll find that you get bad clustered columnstore indexes which will lead to future performance problems.

Some input from my side -
Your fact table should be partitioned . in fact you should have a job which creates the partitions in fact automatically .
how big is fact table ? if your fact table is becoming too big then based on your requirement you can think of introducing archiving of old table if its not required in fact table .

Related

Auto tuning column statistics target in postgresql

As postgresql documents points out one way to increase query performance is to increase statistics target for some columns.
it is known that default_statistics_target value is not enough for large tables (a few million row) with irregular value distribution and must be increased.
it seams practical to create a script for auto-tuning statistics target for each column, i would like to know what are possible obstacles in writing such script and why i can't find such script online.
That's because it is not that simple. It does not primarily depend on the size of the table, but on the data in the table and their distribution, the way in which data are modified, and most of all on the queries.
So it is pretty much impossible to make that decision from a look on the persisted state, and even with more information it would require quite a bit of artificial intelligence.
One problem with PG planner statistics is that there is no way to compute statistics over all the rows inside the table. PG use always a small part of the table to compute statistics (sample percent). This way have a huge disavantage in big table: it will ignore some important values that can make the difference when estimating the cardinality for some operations of the execution plan. This may cause the use of an innappropiate algorithm.
Explanation : http://mssqlserver.fr/postgresql-vs-sql-server-mssql-part-3-very-extremely-detailed-comparison/
Especially § 12 – Planer statistics
The reason that PG do not accept a "full scan" stat, is because it will take too much time to compute ! In fact, PostgreSQL is very slow in many maintenance task such as statistics recompute, as I reveal it here :
http://mssqlserver.fr/postgresql-vs-microsoft-part-1-dba-queries-performances/
In some other RDBMS it is possible to do UPDATE STATISTICS ... WITH FULLSCAN (Microsoft SQL Server as an example) is this does not take so much time, because MS SQL Server does it in parallel with multiple threads that PostGreSQL is unnable to do...
Conclusion: PostGreSQL has never been design for huge table. Think to use another RDBMS if you want to deal with big table and have performances...
Just take a look over COUNT performances of PostGreSQL compare to MS SQL Server:
http://mssqlserver.fr/postgresql-vs-microsoft-sql-server-comparison-part-2-count-performances/

Guideline when designing a database in Postgresql

I am designing a database in Postgresql and I would like to have some expert advices before refactorizing my work.
The database naturally contains different parts that I plan to separate into schemas in order to have a mangling of object names that reflect logical organization of them. About 20 tables are for scientific purposes and 20 others are technical and 20 furthers are about administrative tasks.
Is that a good idea or am I misleading myself into a management overhead that I will regret later?
The database contains 3 tables that are huge. By huge, I mean there is more than 60 millions of rows in it and they might grow a little bit. I think I will create special tablespace for that tables. I would like to do it, in order to separate logically the place where data are stored because the rest of the database should be backuped in a different way than that three tables.
Further more one those 3 tables contains binary data that are not heavy but weight a bit when multiplying by the amount of rows and also this table grows faster than the 2 others. Then I will periodically purge it after backuping the table.
Is it a good idea to have more than one tablespace in a database? If so, is there any precaution to be taken when proceeding this way?
Thank you in advance for your advices.
Choosing good names & grouping database stuffs is always a wise choice, and such overheads are not usually considerable.
About separating tablespace of a single database, it also should not cause any special problem, I've a similar database (but in mysql) that has a large file table and I had to move all of it's content to another server for some optimization reasons and i had no problem with it till now.
There is a very more important matter in RDBMS designing and that's CORRECT TABLE INDEXING. I think choosing good indexes is most critical phase of designing a relational database and you'll see it's effect soon (when you begin to write JOIN queries!).
In general, designing and implementing database is an experimental job that depends to your situation and expertness, so you can't seek for a solid instruction.

PostgreSQL tuning best practices for data warehousing

I have found plenty of online and print guides on how to tune and optimize performance for Postgres for OLTP applications, but I haven't found anything of the sort specific to Data Warehousing applications. Since there are so many differences in the types of workload, I'm sure there has to be some differences in how the databases are managed and tuned.
Some of my own:
I have found from the DDL side that I use indexes a lot more liberally, since I usually only worry about inserts once a day and can do batch inserts with index rebuilds.
I will typically use integer surrogate keys to data that typically has more than one natural key for faster joins
I will usually define and maintain a very comprehensive date table that has prebuilt date manipulations (fiscal date as opposed to calendar date, fiscal year-month, starting day of the week, etc) and use it liberally as opposed to using functions in select statements and where statements. This usually helps during CPU-bound aggregate queries.
I was hoping that I would find some information on memory management and other database settings, but I would be happy to hear any useful best practices specific to Postgres-based Data Warehousing.
My experience (admittedly on a pretty small scale when it comes to data warehouses):
Like you mention, pre-aggregating data is easily the most important thing, as it reduces the amount of data that needs to be read by many orders of magnitude.
Avoid short writing transactions, subtransactions and savepoints. This includes exception handling in PL/pgSQL. These burn through the available "transaction ID" space quickly, and cause expensive "wraparound" vacuums that need to rewrite whole tables.
I found that partitioning tables such that each partition individually can fit in the kernel's cache is good for maintenance and migrations, if you ever need to do any. This means you can recreate all indexes on a partition with just 1 seq scan from disk, instead of one scan for each index.
Like Chris already mentioned, be generous with work_mem and maintenance_work_mem; if your workload doesn't fit in RAM then keeping more temporary data in memory saves I/O and CPU time due to smarter query plans (most importantly HashAggregate).
If you need to do huge sorts, it can help to buy a dedicated SSD for storing the temporary files.
From a memory management perspective one of your largest differences is that you can often hope to keep the working OLTP set in memory while this is not the case with OLAP environments. Additionally very often your joined sets are bigger. This means higher work_mem settings can be very helpful and to the extent tables are denormalized this means one can push work_mem a bit higher than it might be otherwise. I am not sure my advice on shared_buffers would change (I prefer to start low and increase, testing performance at each step) but work_mem certainly would need to increase if you are doing reporting on sets of any size.

How to get high performance under a large transaction (postgresql)

I have data with amount of 2 millions needed to insert into postgresql. But it has played an low performance. Can I achieve a high-performance inserter by split the large transaction into smaller ones (Actually, I don't want to do this)? or, there is any other wise solutions?
No, the main idea to have it much faster is doing all inserts in one transaction. Multiple transactions, or using no transaction, is much slower.
And try to use copy, which is even faster: http://www.postgresql.org/docs/9.1/static/sql-copy.html
If you really have to use inserts, you can also try dropping all indexes on this table, and creating them after loading the data.
This can be interesting as well: http://www.postgresql.org/docs/9.1/static/populate.html
Possible methods to improve performance:
Use the COPY command.
Try to decrease the isolation level for the transaction if your data can deal with the consequences.
Tweak the PostgreSQL server configuration. The default memory limits are very low and will cause disk trashing even with a server having gigabytes of free memory.
Turn off disk barriers (e.g. nobarrier flag for the ext4 file system) and/or fsync on the PostgreSQL server. Warning: this is usually unsafe but will improve your performance a lot.
Drop all the indexes in your table before inserting the data. Some indexes require pretty much work to keep up to date while rows are added. PostgreSQL may be able to create indexes faster in the end instead of continuously updating the indexes in paraller with the insertion process. Unfortunately, there's no simple way to "save" current indexes and later restore/create the same indexes again.
Splitting the insert job into series of smaller transaction will help only if you have to retry the transaction because of data dependency issues with paraller transactions. If the transaction succeeds on the first try, splitting it into several smaller transactions run in sequence will only decrease your performance.
In my experience you CAN improve INSERT time-to-completion by splitting a large transaction into smaller ones, but only if the table you are inserting to has NO indexes or constraints applied, and NO default field values that would have to contend for a shared resource under multiple concurrent transactions. In that case, splitting the insert into several distinct parts and submitting each concurrently as separate processes will complete the job in significantly less time.

PostgreSQL to Data-Warehouse: Best approach for near-real-time ETL / extraction of data

Background:
I have a PostgreSQL (v8.3) database that is heavily optimized for OLTP.
I need to extract data from it on a semi real-time basis (some-one is bound to ask what semi real-time means and the answer is as frequently as I reasonably can but I will be pragmatic, as a benchmark lets say we are hoping for every 15min) and feed it into a data-warehouse.
How much data? At peak times we are talking approx 80-100k rows per min hitting the OLTP side, off-peak this will drop significantly to 15-20k. The most frequently updated rows are ~64 bytes each but there are various tables etc so the data is quite diverse and can range up to 4000 bytes per row. The OLTP is active 24x5.5.
Best Solution?
From what I can piece together the most practical solution is as follows:
Create a TRIGGER to write all DML activity to a rotating CSV log file
Perform whatever transformations are required
Use the native DW data pump tool to efficiently pump the transformed CSV into the DW
Why this approach?
TRIGGERS allow selective tables to be targeted rather than being system wide + output is configurable (i.e. into a CSV) and are relatively easy to write and deploy. SLONY uses similar approach and overhead is acceptable
CSV easy and fast to transform
Easy to pump CSV into the DW
Alternatives considered ....
Using native logging (http://www.postgresql.org/docs/8.3/static/runtime-config-logging.html). Problem with this is it looked very verbose relative to what I needed and was a little trickier to parse and transform. However it could be faster as I presume there is less overhead compared to a TRIGGER. Certainly it would make the admin easier as it is system wide but again, I don't need some of the tables (some are used for persistent storage of JMS messages which I do not want to log)
Querying the data directly via an ETL tool such as Talend and pumping it into the DW ... problem is the OLTP schema would need tweaked to support this and that has many negative side-effects
Using a tweaked/hacked SLONY - SLONY does a good job of logging and migrating changes to a slave so the conceptual framework is there but the proposed solution just seems easier and cleaner
Using the WAL
Has anyone done this before? Want to share your thoughts?
Assuming that your tables of interest have (or can be augmented with) a unique, indexed, sequential key, then you will get much much better value out of simply issuing SELECT ... FROM table ... WHERE key > :last_max_key with output to a file, where last_max_key is the last key value from the last extraction (0 if first extraction.) This incremental, decoupled approach avoids introducing trigger latency in the insertion datapath (be it custom triggers or modified Slony), and depending on your setup could scale better with number of CPUs etc. (However, if you also have to track UPDATEs, and the sequential key was added by you, then your UPDATE statements should SET the key column to NULL so it gets a new value and gets picked by the next extraction. You would not be able to track DELETEs without a trigger.) Is this what you had in mind when you mentioned Talend?
I would not use the logging facility unless you cannot implement the solution above; logging most likely involves locking overhead to ensure log lines are written sequentially and do not overlap/overwrite each other when multiple backends write to the log (check the Postgres source.) The locking overhead may not be catastrophic, but you can do without it if you can use the incremental SELECT alternative. Moreover, statement logging would drown out any useful WARNING or ERROR messages, and the parsing itself will not be instantaneous.
Unless you are willing to parse WALs (including transaction state tracking, and being ready to rewrite the code everytime you upgrade Postgres) I would not necessarily use the WALs either -- that is, unless you have the extra hardware available, in which case you could ship WALs to another machine for extraction (on the second machine you can use triggers shamelessly -- or even statement logging -- since whatever happens there does not affect INSERT/UPDATE/DELETE performance on the primary machine.) Note that performance-wise (on the primary machine), unless you can write the logs to a SAN, you'd get a comparable performance hit (in terms of thrashing filesystem cache, mostly) from shipping WALs to a different machine as from running the incremental SELECT.
if you can think of a 'checksum table' that contains only the id's and the 'checksum' you can not only do a quick select of the new records but also the changed and deleted records.
the checksum could be a crc32 checksum function you like.
The new ON CONFLICT clause in PostgreSQL has changed the way I do many updates. I pull the new data (based on a row_update_timestamp) into a temp table then in one SQL statement INSERT into the target table with ON CONFLICT UPDATE. If your target table is partitioned then you need to jump through a couple of hoops (i.e. hit the partition table directly). The ETL can happen as you load the the Temp table (most likely) or in the ON CONFLICT SQL (if trivial). Compared to to other "UPSERT" systems (Update, insert if zero rows etc.) this shows a huge speed improvement. In our particular DW environment we don't need/want to accommodate DELETEs. Check out the ON CONFLICT docs - it gives Oracle's MERGE a run for it's money!