What could be the reason for "heavy IO" activity in the database when SQL Server Change Tracking is enabled? - sql-server-2008-r2

I am doing some tests to quantify the performance/reliability of the Change Tracking feature of SQL Server. I have a single table t1 in which I insert 1 million rows, once with Change Tracking OFF and once with Change Tracking On. I am monitoring the sizes of syscommittab, the size of the change tracking table and the I/Os recorded against the database. As is to be expected, the change tracking table and syscommitab only get populated when Change Tracking is ON. And I expect the IOs recorded against the database to be "proportional" to the sizes of these tables. But to my surprise, these are way off. The IOs recorded against the database are many orders more than the sizes of these 2 tables. Wondering if anyone knows why or can give me pointers to figuring it out. I am using sys.dm_io_virtual_file_stats() to determine the IO activity on the database and the following query to determine the sizes of the tracked table and syscommittab.
SELECT sct1.name as CT_schema,
sot1.name as CT_table,
ps1.row_count as CT_rows,
ps1.reserved_page_count*8./1024. as CT_reserved_MB,
sct2.name as tracked_schema,
sot2.name as tracked_name,
ps2.row_count as tracked_rows,
ps2.reserved_page_count*8./1024. as tracked_base_table_MB,
change_tracking_min_valid_version(sot2.object_id) as min_valid_version
FROM sys.internal_tables it
JOIN sys.objects sot1 on it.object_id=sot1.object_id
JOIN sys.schemas AS sct1 on
sot1.schema_id=sct1.schema_id
JOIN sys.dm_db_partition_stats ps1 on
it.object_id = ps1. object_id
and ps1.index_id in (0,1)
LEFT JOIN sys.objects sot2 on it.parent_object_id=sot2.object_id
LEFT JOIN sys.schemas AS sct2 on
sot2.schema_id=sct2.schema_id
LEFT JOIN sys.dm_db_partition_stats ps2 on
sot2.object_id = ps2. object_id
and ps2.index_id in (0,1)
WHERE it.internal_type IN (209, 210)
and (sot2.name='t1' or sot1.name='syscommittab')
I am checkpointing before running the queries.
Any tip or pointer appreciated.
Acknowledgements to https://www.brentozar.com/archive/2014/06/performance-tuning-sql-server-change-tracking/ for the SQL above.

Related

Npgsql - FULL OUTER JOIN on two unrelated tables

I have two tables, point_transactions which shows how users got and spent their in-app points, and wallet_transactions which shows how users got and spent their wallet money (real money). These two tables do not have direct relation with each other. They both have a created_on column which shows when they were created. I need to create a table that shows history of a user's transactions (both point and wallet). This table is sorted based on the creation time of the transaction and has paging, which means it's better to get paged result from database rather than loading all data into memory.
The following query gives me what I want:
select *,
case
when pt.id is null then wt.created_on
else pt.created_on
end as tx_created_on
from point_transactions as pt
full outer join wallet_transactions as wt on false
order by tx_created_on desc
Is there any way I can get this with EF Core?

Joining too many tables makes Postgres query extremely slow

I've been trying to optimize this simple query on Postgres 12 that joins several tables to a base relation. They each have 1-to-1 relation and have anywhere from 10 thousand to 10 million rowss.
SELECT *
FROM base
LEFT JOIN t1 ON t1.id = base.t1_id
LEFT JOIN t2 ON t2.id = base.t2_id
LEFT JOIN t3 ON t3.id = base.t3_id
LEFT JOIN t4 ON t4.id = base.t4_id
LEFT JOIN t5 ON t5.id = base.t5_id
LEFT JOIN t6 ON t6.id = base.t6_id
LEFT JOIN t7 ON t7.id = base.t7_id
LEFT JOIN t8 ON t8.id = base.t8_id
LEFT JOIN t9 ON t9.id = base.t9_id
(the actual relations are a bit more complicated than this, but for demonstration purposes this is fine)
I noticed that the query is still very slow when I only do SELECT base.id which seems odd, because then query planner should know that the joins are unnecessary and shouldn't affect the performance.
Then I noticed that 8 seems to be some kind of magic number. If I remove any single one of the joins, the query time goes from 500ms to 1ms. With EXPLAIN I was able to see that Postgres is doing index only scans when joining 8 tables, but with 9 tables it starts doing sequential scans.
That's even when I only do SELECT base.id so somehow the amount of tables is tripping up the query planner.
We finally found out that there is indeed a configuration setting in postgres called join_collapse_limit, which is set to 8 by default.
https://www.postgresql.org/docs/current/runtime-config-query.html
The planner will rewrite explicit JOIN constructs (except FULL JOINs) into lists of FROM items whenever a list of no more than this many items would result. Smaller values reduce planning time but might yield inferior query plans. By default, this variable is set the same as from_collapse_limit, which is appropriate for most uses. Setting it to 1 prevents any reordering of explicit JOINs. Thus, the explicit join order specified in the query will be the actual order in which the relations are joined. Because the query planner does not always choose the optimal join order, advanced users can elect to temporarily set this variable to 1, and then specify the join order they desire explicitly.
After reading this article we decided to increase the limit, along with other values such as from_collapse_limit and geco_threshold. Beware that query planning time increases exponentially with the amount of joins, so the limit is there for a reason and should not be increased carelessly.

How to minimize the use of CTEs in a Postgresql query

The follow code works well with a small number of vector features. However when I run the query using a larger table (~35,000 rows), my memory use goes to 100% (32GB) and then I get a "Connection to the server has been lost" in pgadmin. I am running on localhost, so the issue is not network related. I'm guessing its because I am using to many CTE's (WITH queries). I was thinking of nesting the query in a PL/pgSQL loop and updating a table with the results. Thereby closing the temporary tables after each iteration. This seems like an inelegant solution and I was hoping someone might be able to show me how I can minimize the use of CTE's in the below query.
CREATE TABLE dem_stats AS
WITH
-- Select Features using lookup table and determine the raster tiles said features are intersecting
feat AS
(SELECT title_no,
a.grid_tile_name || '.asc' AS tile_name,
a.wkb_geometry as geom
FROM test_polygons a, parcels_all_shapefile_lookup_osgb_grid_5km b
WHERE a.title_no = b.olp_title_no
),
-- Merge rasters tiles from main raster file that intersect features
merged_rast AS
(SELECT ST_Union(rast,1) AS rast
FROM dem, feat
WHERE filename
IN (tile_name)
),
-- As the tiles are now merged duplicates are not required
feat_temp AS
(SELECT DISTINCT ON (title_no) * FROM feat
),
-- Clip merged raster and obtain pixel statistics
b_stats AS
(SELECT title_no, (stats).*
FROM (SELECT title_no, ST_SummaryStats(ST_Clip(a.rast,1,b.geom,-9999,true)) AS stats
FROM merged_rast a
INNER JOIN feat_temp b
ON ST_Intersects(b.geom,a.rast)
) AS foo
)
-- Summarise statistics for each title number
SELECT title_no,
count As pixel_val_count,
min AS pixel_val_min,
max AS pixel_val_max,
mean AS pixel_val_mean,
stddev AS pixel_val_stddev
FROM b_stats
WHERE count > 0;
though it is not exactly readable you can always inline the CTEs - this is sometimes a good idea because CTEs are optimization fences in PostgreSQL: https://blog.2ndquadrant.com/postgresql-ctes-are-optimization-fences/

Left Join Hangs

I am trying to figure out what could be causing a left join to hang. I've narrowed a problem down to a specific table but I can't for the life of me figure out what might be going on. Basically, I have two tables, lets call them table A and table B. When I left join table A to table B (its a 1 to 1 relationship with table B not always having a related record to table A) the query hangs. When I inner join table A to table B, it runs in about a half second returning about 27,000 records. Why is it when I run a left join, which should take a bit longer but not by much, it hangs? Could I have bad data in table B? The fields I'm joining are bigint's. I'm stumped on this one. Any help would be much appreciated.
Here is my sql:
select
RegMemberTrip.idmember,
RegParent1.idMember_Parent1,
regparent1.idParent1
from
regmembertrip
left join
regparent1 on RegMemberTrip.idmember = regparent1.idMember_Parent1
where
regmembertrip.IDRound = 25
RegParent1 is a view
If I change the where criteria to '= 24' it works fine. IDRound = 25 is fairly new data. And like I said, if I keep this the way it is (idround = 25) with an inner join it works fine.
Thanks,
Ben
Have you tried the execution path tool in the Management Console? Are you sure your left join is not in fact doing a giant cartesian product across A and B?

PostgreSQL slow COUNT() - is trigger the only solution?

I have a table with posts, which are categorized by:
type
tag
language
All of those "categories" are stored in next tables (posts_types) and connected via next tables (posts_types_assignment).
COUNTing in PostgreSQL is really slow (i have more than 500k records in that table) and i need to get the number of posts categorized by any combination of type/tag/lang.
If i would solve it through triggers, it would be full of many multi-level loops, which really doesn't look like nice and is hard to maintenance.
Is there any other solution how to effectively get actual number of posts categorized in any type/tag/language?
Let me get this straight.
You have a table posts. You have a table posts_types. The two have a many to many join on posts_types_assignment. And you have some query like this that is slow:
SELECT count(*)
FROM posts p
JOIN posts_types_assigment pta1
ON p.id = pta1.post_id
JOIN posts_types pt1
ON pt1.id = pta1.post_type_id
AND pt1.type = 'language'
AND pt1.name = 'English'
JOIN posts_types_assigment pta2
ON p.id = pta2.post_id
JOIN posts_types pt2
ON pt2.id = pta2.post_type_id
AND pt2.type = 'tag'
AND pt2.name = 'awesome'
And you would like to know why it is painfully slow.
My first note is that PostgreSQL would have to do a lot less work if you had the identifiers in the posts table rather than in the joins. But that is a moot issue, the decision has been made.
My more useful note is that I believe that PostgreSQL has a similar query optimizer to Oracle. In which case to limit the combinatorial explosion of possible query plans that it has to consider, it only considers plans that start with some table, and then repeatedly joins on one more data set at a time. However no such query plan will work here. You can start with pt1, get 1 record, then go to pta1, get a bunch of records, join p, wind up with the same number of records, then join pta2, and now you get a huge number of records, then join to pt2, get just a few records. Joining to pta2 is the slow step, because the database has no idea which records you want, and therefore has to create a temporary result set for every combination of a post and a piece of metadata (type, language or tag) on it.
If this is indeed your problem, then the right plan looks like this. Join pt1 to pta1, put an index on it. Join pt2 to pta2, then join to the result of the first query, then join to p. Then count. This means that we don't get huge result sets.
If this the case, there is no way to tell the query optimizer that this once you want it to think up a new type of execution plan. But there is a way to force it.
CREATE TEMPORARY TABLE t1
AS
SELECT pta*
FROM posts_types pt
JOIN posts_types_assignment pta
ON pt.id = pta.post_type_id
WHERE pt.type = 'language'
AND pt.name = 'English';
CREATE INDEX idx1 ON t1 (post_id);
CREATE TEMPORARY TABLE t2
AS
SELECT pta*
FROM posts_types pt
JOIN posts_types_assignment pta
ON pt.id = pta.post_type_id
JOIN t1
ON t1.post_id = pta.post_id
WHERE pt.type = 'language'
AND pt.name = 'English';
SELECT COUNT(*)
FROM posts p
JOIN t1
ON p.id = t1.post_id;
Barring random typos, etc, this is likely to perform somewhat better. If it doesn't, double check the indexes on your tables.
As btilly notes, and if he has correctly guessed the schema, the table design does not help - it seems (at first sight, at least) that, for example, to have three tables posts_tag(post_id,tag) post_lang(post_id,lang) post_type(post_id,type) would be more natural and much more efficient.
Apart from that (or in addition to that), one could think of a table or materialized view that summarizes all the possible countings, with columns (lang,type,tag,nposts). Of course, to compute this in full would be VERY slow, but (apart from the first time) it can be done either in full "in background", at some intervals (if the data does not vary much, and if you don't require exact counts), or eagerly with triggers.
See for example here