PostgreSQL: one index is fast and another one is slow - postgresql

We have a database with 450 million rows structured like this:
uid id_1 id_2 d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 d13 d14 d15 d16 d17
81038392 5655067 5468882 373 117 185 152 199 173 168 138 185 159 154 34 38 50 34 41 57
81038393 5655067 5468883 374 116 184 118 170 143 144 113 164 137 138 37 39 53 37 42 60
81038394 5655067 5468884 371 118 187 118 170 143 144 105 157 131 136 32 35 47 32 39 53
81038395 5655067 5468885 370 116 184 118 170 143 144 105 157 131 136 31 35 46 31 38 53
81038396 5655067 5468886 370 117 185 118 170 143 144 105 157 131 136 29 34 44 29 37 50
81038397 5655067 5470853 368 117 185 110 163 137 140 105 157 131 136 34 36 48 34 39 55
81038398 5655067 5470854 372 119 188 118 170 143 144 113 164 137 138 34 36 49 34 40 55
81038399 5655067 5470855 360 115 182 103 151 131 136 98 145 125 131 30 34 45 30 38 51
81038400 5655067 5470856 357 112 177 103 151 131 136 98 145 125 131 30 34 45 30 37 51
81038401 5655067 5470857 356 111 176 103 151 131 136 98 145 125 131 28 33 43 28 36 50
81038402 5655067 5470858 358 113 179 103 151 131 136 98 145 125 131 31 35 46 31 38 52
81038403 5655067 5472811 344 109 173 152 199 173 168 138 185 159 154 31 36 46 31 39 52
81038404 5655068 5468882 373 117 185 152 199 173 168 138 185 159 154 34 38 50 34 41 57
81038405 5655068 5468883 374 116 184 118 170 143 144 113 164 137 138 37 39 53 37 42 60
81038406 5655068 5468884 371 118 187 118 170 143 144 105 157 131 136 32 35 47 32 39 53
81038407 5655068 5468885 370 116 184 118 170 143 144 105 157 131 136 31 35 46 31 38 53
81038408 5655068 5468886 370 117 185 118 170 143 144 105 157 131 136 29 34 44 29 37 50
81038409 5655068 5470853 368 117 185 110 163 137 140 105 157 131 136 34 36 48 34 39 55
81038410 5655068 5470854 372 119 188 118 170 143 144 113 164 137 138 34 36 49 34 40 55
81038411 5655068 5470855 360 115 182 103 151 131 136 98 145 125 131 30 34 45 30 38 51
81038412 5655068 5470856 357 112 177 103 151 131 136 98 145 125 131 30 34 45 30 37 51
81038413 5655068 5470857 356 111 176 103 151 131 136 98 145 125 131 28 33 43 28 36 50
81038414 5655068 5470858 358 113 179 103 151 131 136 98 145 125 131 31 35 46 31 38 52
We need to constantly do queries like this:
Query 1:
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM mytable WHERE id_1 = 5655067;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Index Scan using id_1_idx on mytable (cost=0.57..99187.68 rows=25742 width=80) (actual time=47.081..2600.899 rows=21487 loops=1)
Index Cond: (id_1 = 5655067)
Buffers: shared hit=9 read=4816
I/O Timings: read=2563.181
Planning time: 0.151 ms
Execution time: 2602.320 ms
(6 rows)
Query 2:
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM mytable WHERE id_2 = 5670433;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on mytable (cost=442.02..89887.42 rows=23412 width=80) (actual time=113.200..42127.512 rows=21487 loops=1)
Recheck Cond: (id_2 = 5670433)
Heap Blocks: exact=16988
Buffers: shared hit=30 read=17020
I/O Timings: read=41971.798
-> Bitmap Index Scan on id_2_idx (cost=0.00..436.16 rows=23412 width=0) (actual time=104.928..104.929 rows=21487 loops=1)
Index Cond: (id_2 = 5670433)
Buffers: shared hit=2 read=60
I/O Timings: read=99.235
Planning time: 0.163 ms
Execution time: 42132.556 ms
(11 rows)
There are around 23 000 to 25 000 unique
id_1 and id_2 values and both queries will always return around 24 000 rows of the data. We are only reading data and the data does not change over time.
The problem:
The Query 1 takes around 3 seconds, which is a bit much but still bearable.
The Query 2 takes up to 30-40 seconds, which is way too much for us as the service is interactive web service.
We have indexed id_1 and id_2. We also added a joint index on id_1 and id_2 as this was suggested by Azure PostgreSQL As A Service platform where the data is located. It did not help.
My assumption is that the Query 1 is fast since all the rows are located sequentially in the database, whereas when Query 2 is used the the rows are always distributed throughout the whole database non-sequentially.
Restructuring the data to speed up the Query 2 is not a good idea as that would reduce the performance is Query 1. I understand that the way this data is structured is not ideal, but I do not have control over it. Any suggestions how I could speed up the Query 2 to reasonable level?
Edit 2:
Create index statements:
CREATE INDEX id_1_idx ON mytable (id_1);
CREATE INDEX id_2_idx ON mytable (id_2);
Vacuuming the table did not change the plan. The outputs from EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM mytable WHERE id_1 = 5655067 are very similar after the vacuuming. Here is the output from verbose vacuum:
VACUUM (VERBOSE, ANALYZE) mytable;
INFO: vacuuming "public.mytable"
INFO: index "mytable_pkey" now contains 461691169 row versions in 1265896 pages
DETAIL: 0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 2695.21 s.
INFO: index "id_1_idx" now contains 461691169 row versions in 1265912 pages
DETAIL: 0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 1493.20 s.
INFO: index "id_2_idx" now contains 461691169 row versions in 1265912 pages
DETAIL: 0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 1296.06 s.
INFO: index "mytable_id_1_id_2_idx" now contains 461691169 row versions in 1265912 pages
DETAIL: 0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 2364.16 s.
INFO: "mytable": found 0 removable, 389040319 nonremovable row versions in 5187205 out of 6155883 pages
DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 12767
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 13560.60 s.
INFO: analyzing "public.mytable"
INFO: "mytable": scanned 30000 of 6155883 pages, containing 2250000 live rows and 0 dead rows; 30000 rows in sample, 461691225 estimated total rows
VACUUM

TL;DR
Storage I/O is you major bottleneck + not enough RAM for indexes as you can simply calculate yourself:
For the bitmap heap scan you can calculate an average block read latency of ~2.5 milliseconds (17020 blocks read in 41971.798 ms), which is way too slow.
The only way to avoid disk reads is lots of RAM. Faster storage would make the system far more scalable as most likely this is not the only type of queries and not the only table in the database.
Long Version:
Reading the perfect output of the EXPLAIN it indicates that the cost evaluation done by the planner is way off and that the performance drop comes from disk reads.
As you wrote that the data does not change over time (and hence, you know the value ranges in advance) you can also range-partition your table on those two columns, which then would have to only scan a certain partition (using smaller indexes, reading smaller table heap). But if the application accessing this data is eventually accessing the full range of data more or less than this would also not help much.
As a result, you should think about replacing the storage subsystem to be able to handle your queries within the performance requirements that your application has.
I have the suspect that the PostgreSQL server is still running on HDD rather than SSD. A little test with only 120M rows shows the following characteristics for both indexes:
create table nums (uid integer primary key, id_1 integer, id_2 integer, d1 integer, d2 integer, d3 integer, d4 integer, d5 integer, d6 integer, d7 integer, d8 integer, d9 integer, d10 integer, d11 integer, d12 integer, d13 integer, d14 integer, d15 integer, d16 integer, d17 integer);
INSERT INTO nums select generate_series(80000001, 200000000) AS uid, (random() * 23000)::integer + 5600000 AS id_1, (random() * 25000)::integer + 5600000 AS id_2, (random() * 1000)::integer AS d1, (random() * 1000)::integer AS d2, (random() * 1000)::integer AS d3, (random() * 1000)::integer AS d4, (random() * 1000)::integer AS d5, (random() * 1000)::integer AS d6, (random() * 1000)::integer AS d7, (random() * 1000)::integer AS d8, (random() * 1000)::integer AS d9, (random() * 1000)::integer AS d10, (random() * 1000)::integer AS d11, (random() * 100)::integer AS d12, (random() * 100)::integer AS d13, (random() * 100)::integer AS d14, (random() * 100)::integer AS d15, (random() * 100)::integer AS d16, (random() * 100)::integer AS d17;
create index id_1_idx on nums (id_1);
create index id_2_idx on nums (id_2);
cluster nums using id_1_idx;
...resulting into the following (both cold reads):
explain (analyze, buffers) select * from nums where id_1 = 5606001;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Index Scan using id_1_idx on nums (cost=0.57..5816.92 rows=5198 width=80) (actual time=1.680..6.394 rows=5185 loops=1)
Index Cond: (id_1 = 5606001)
Buffers: shared read=88
I/O Timings: read=4.397
Planning Time: 4.002 ms
Execution Time: 7.475 ms
(6 rows)
Time: 15.924 ms
...and for id_2:
explain (analyze, buffers) select * from nums where id_2 = 5606001;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Index Scan using id_2_idx on nums (cost=0.57..5346.53 rows=4777 width=80) (actual time=0.376..985.689 rows=4748 loops=1)
Index Cond: (id_2 = 5606001)
Buffers: shared hit=1 read=4755
I/O Timings: read=972.555
Planning Time: 0.203 ms
Execution Time: 986.590 ms
(6 rows)
Time: 987.296 ms
So although my table is "just" 12 GiB + 3x 2.5 GiB (PK + 2 indexes) is is still fast enough.
In case the server already is running on SSD, please make sure to (physically) separate data storage for WAL/log, table data (tablespace), indexes (tablespace) to benefit as much as possible from parallelism and to reduce I/O interference caused by other services/applications on the same system.
Also think about a server system with way more memory for the table and index data (for this ~ 48 GiB table + ~10 GiB per index, assuming all integer columns) and then do a warm-up to push data from disk into memory. At least indexes should be able to completely stay in memory.
EDIT:
The reason my server does not use a bitmap (index + heap) scan is because I am running on SSD and I have adapted the random page cost from default of 4 down to 1.1. For an HDD system, that would make no sense, of course.
EDIT #2:
A retest of the situation has revealed an interesting behavior:
In my test, I assumed the first column uid to be the primary key column and be a serial (sequential integer), by which the entries are initially sorted on disk. While generating the data, the values for the both interesting indexed columns id_1 and id_2 are generated randomly, which usually ends up being worst case for big tables.
However, not so in this case. After creating the test data and the indexes and after analyzing the table but before the data reordering using the index on column id_1 I am getting these results now:
explain (analyze, buffers) select * from nums where id_1 = 5606001;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on nums (cost=63.32..7761.68 rows=5194 width=80) (actual time=1.978..41.007 rows=5210 loops=1)
Recheck Cond: (id_1 = 5606001)
Heap Blocks: exact=5198
Buffers: shared read=5217
I/O Timings: read=28.732
-> Bitmap Index Scan on id_1_idx (cost=0.00..62.02 rows=5194 width=0) (actual time=1.176..1.176 rows=5210 loops=1)
Index Cond: (id_1 = 5606001)
Buffers: shared read=19
I/O Timings: read=0.124
Planning Time: 7.214 ms
Execution Time: 41.419 ms
(11 rows)
...and:
explain (analyze, buffers) select * from nums where id_2 = 5606001;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on nums (cost=58.52..7133.04 rows=4768 width=80) (actual time=7.305..43.830 rows=4813 loops=1)
Recheck Cond: (id_2 = 5606001)
Heap Blocks: exact=4805
Buffers: shared hit=12 read=4810
I/O Timings: read=28.181
-> Bitmap Index Scan on id_2_idx (cost=0.00..57.33 rows=4768 width=0) (actual time=5.102..5.102 rows=4813 loops=1)
Index Cond: (id_2 = 5606001)
Buffers: shared read=17
I/O Timings: read=2.414
Planning Time: 0.227 ms
Execution Time: 44.197 ms
(11 rows)
All plans + optimizations available here:
using id_1_idx
using id_2_idx
I also followed my own best practices and separated out the indexes to another tablespace on different physical SSD here.
As we can see, to fetch the ~5000 resulting rows it has to read more or less the same number of blocks here, in both cases using the bitmap heap scan.
The correlation for the two columns in this case:
attname | correlation | n_distinct
---------+-------------+------------
id_1 | -0.0047043 | 23003
id_2 | 0.00157998 | 25004
Now, retesting the queries after the CLUSTER ... USING id_1_idx and after re-analyzing it, resulting in the following correlation:
attname | correlation | n_distinct
---------+--------------+------------
id_1 | 1 | 22801
id_2 | -0.000898521 | 24997
...revealed the following performances:
explain (analyze, buffers) select * from nums where id_1 = 5606001;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Index Scan using id_1_idx on nums (cost=0.57..179.02 rows=5083 width=80) (actual time=2.604..5.256 rows=5210 loops=1)
Index Cond: (id_1 = 5606001)
Buffers: shared read=90
I/O Timings: read=4.107
Planning Time: 4.039 ms
Execution Time: 5.563 ms
(6 rows)
...which is much better - just as expected - but:
explain (analyze, buffers) select * from nums where id_2 = 5606001;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on nums (cost=58.57..7140.12 rows=4775 width=80) (actual time=5.866..99.707 rows=4813 loops=1)
Recheck Cond: (id_2 = 5606001)
Heap Blocks: exact=4806
Buffers: shared read=4823
I/O Timings: read=31.389
-> Bitmap Index Scan on id_2_idx (cost=0.00..57.38 rows=4775 width=0) (actual time=2.992..2.992 rows=4813 loops=1)
Index Cond: (id_2 = 5606001)
Buffers: shared read=17
I/O Timings: read=0.338
Planning Time: 0.210 ms
Execution Time: 100.155 ms
(11 rows)
...more than twice as slow, despite the fact that almost the exact same number of blocks had to be read as in the first random run.
Why does it slow down so much?
The physical re-ordering of the table data using index id_1_idx also affected the physical order for the column. Now, the purpose of the bitmap heap scan is to get a list of blocks to read in physical (on-disk) order from the bitmap index scan. In the first case (random), there was quite a good chance that multiple rows matching the criteria where located in consecutive blocks on-disk, resulting in less random disk access.
Interestingly (but this might just be because I am running on SSD), disabling the bitmap scan revealed acceptable numbers:
explain (analyze, buffers) select * from nums where id_2 = 5606001;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Index Scan using id_2_idx on nums (cost=0.57..7257.12 rows=4775 width=80) (actual time=0.151..35.453 rows=4813 loops=1)
Index Cond: (id_2 = 5606001)
Buffers: shared read=4823
I/O Timings: read=30.051
Planning Time: 1.927 ms
Execution Time: 35.810 ms
(6 rows)
All these numbers are almost complete cold-start executions (as you can see with no or very low Buffers: shared hit numbers.
Interesting also is that the I/O timings are pretty similar between the bitmap scan and index scan for id_2, but the bitmap scan seems to introduce a huge overhead here.

The difference is that id_1 is highly correlated, i.e. the order of that column corresponds to the physical order of the rows, while id_2 is not correlated.
Test with
SELECT attname, correlation
FROM pg_stats
WHERE tablename = 'mytable'
AND attname IN ('id_1', 'id_2');
If the correlation is high, the rows for a single value of the column will be in a few adjacent blocks of the table. If the correlation is low, the rows will be all over the table and many more blocks have to be read.
To achieve high correlation, you can rewrite a table using the CLUSTER statement to reorder the rows. If there are no deletes and updates, a table will be physically ordered in insertion order.
You can speed up one query or the other, but not both.

Related

Trying to partition to remove rows where two columns don't match sql

How can I filter out rows within a group that do not have matching values in two columns?
I have a table A like:
CODE
US_ID
US_PRICE
NON_US_ID
NON_US_PRICE
5109
57
10
75
10
0206
85
11
58
11
0206
85
15
33
14
0206
85
41
22
70
T100
20
10
49
NULL
T100
20
38
64
38
Within each CODE group, I want to check whether US_PRICE = NON_US_PRICE and remove that row from the resulting table.
I tried:
SELECT *,
CASE WHEN US_PRICE != NON_US_PRICE OVER (PARTITION BY CODE) END
FROM A;
but I think I am missing something when I try to partition by CODE.
I want the resulting table to look like
CODE
US_ID
US_PRICE
NON_US_ID
NON_US_PRICE
0206
85
15
33
14
0206
85
41
22
70
T100
20
10
49
NULL
For provided sample, simple WHERE clause could produce such result:
SELECT *
FROM A
WHERE US_PRICE IS DISTINCT FROM NON_US_PRICE;
IS DISTINCT FROM handles NULLs comparing to != operator.

How to delete duplicate rows without unique ID

Id
SleepDay
TotalMinutesAsleep
TotalTimeInBed
8378563200
4/20/2016
381
409
8378563200
4/21/2016
396
417
8378563200
4/22/2016
441
469
8378563200
4/23/2016
565
591
8378563200
4/24/2016
458
492
8378563200
4/25/2016
388
402 ---> this is the duplicate
8378563200
4/25/2016
388
402
8378563200
4/26/2016
550
584
8378563200
4/27/2016
531
600
This is part of my table and how can I delete the duplicate row? I use CTE clause but it deleted all records of id #8378563200 on 4/25/2016.
Use:
DELETE
FROM table1
WHERE ctid IN (SELECT ctid
FROM (SELECT ctid,
ROW_NUMBER() OVER (
PARTITION BY Id, SleepDay,TotalMinutesAsleep,TotalTimeInBed ) AS rn
FROM table1) t
WHERE rn > 1);
Replace table1 with your own table name.
Without column(s) to identify a unique row?
Then you could use ctid.
ctid
The physical location of the row version within its table. Note
that although the ctid can be used to locate the row version very
quickly, a row's ctid will change if it is updated or moved by VACUUM
FULL. Therefore ctid is useless as a long-term row identifier. A
primary key should be used to identify logical rows
For example:
delete
from SleepLogs log1
using SleepLogs log2
where log2.Id = log1.Id
and log2.SleepDay = log1.SleepDay
and log2.TotalMinutesAsleep = log1.TotalMinutesAsleep
and log2.TotalTimeInBed = log1.TotalTimeInBed
and log2.ctid < log1.ctid;
1 rows affected
select * from SleepLogs
id
sleepday
totalminutesasleep
totaltimeinbed
8378563200
2016-04-20
381
409
8378563200
2016-04-21
396
417
8378563200
2016-04-22
441
469
8378563200
2016-04-23
565
591
8378563200
2016-04-24
458
492
8378563200
2016-04-25
388
402
8378563200
2016-04-26
550
584
8378563200
2016-04-27
531
600
Test on db<>fiddle here

Problem counting items in an individual row without duplication

I'm trying to write a query that will include a count for the primary and secondary activity only when Group ID = 260 and Item id in(1302,1303,1305,1306) for each individual RecordID. So far I have been able to single out the rows with those conditions, but I only want to count the primary and secondary activities once(because the Primary and Secondary activities are the same for their corresponding RecordID regardless of how many rows there are), if they aren't null, regardless of how many RecordID's match those conditions.
RecordID: GroupID: ItemID: PrimActivity: SecActivity:
320 260 1302 36 0
320 260 6456 36 0
320 312 1303 36 0
560 400 1302 46 48
560 312 1305 46 48
460 260 1305 45 56
460 260 1302 45 56
Result I'm getting:
RecordID: Count:
320 2
460 4
Expected result:
RecordID: Count:
320 1
460 2
SELECT dfr.RecordID,
COUNT(CASE WHEN dfr.PrimActivity <> 0 and a.GroupID =260 then 1
ELSE NULL END) +
COUNT(CASE WHEN dfr.SecActivity <> 0 and a.GroupID =260 then 1 ELSE
NULL END) AS Count
From ActivityItem ai
Join DailyRecord dfr on ai.PrimActivity = dfr.PrimActivity
Join AreaInfo af on af.AreaInfoID = dfr.AreaInfoID
Join Information a on dfr.RecordID = a.RecordID
Join Lookup lp on lp.ItemID = a.ItemID
WHERE a.GroupID like '260' and EXISTS(
SELECT b.RecordID, b.GroupID, b.ItemID
FROM Areainfo b
where a.RecordID=b.RecordID and b.ItemID IN(1302,1303,1305,1306)
GROUP BY dfr.RecordID
You should be more clear when you explain the structure of tables you are using. However, I reach the expected result starting from your sample table doing this:
SELECT RecordID,COUNT(*) as Count
FROM (SELECT DISTINCT RecordID,ItemID,PrimActivity,SecActivity
FROM [TABLE YOU POSTED]
WHERE GroupID = 260 and ItemID in (1302,1303,1305,1306) ) A
GROUP BY RecordID

SELECT FROM VALUES used a bit like a CASE statement - but possibly more powerful

I just found myself writing the code below - which works.
Interesting, but is it necessarily the best method?
the syntax allows the TRY_CAST to only be performed once.
Note "Atextfield" can contain valid numbers and invalid numbers.
SELECT *
FROM call
WHERE
EXISTS ( SELECT 1
FROM ( VALUES( TRY_CAST(call.[Atextfield] AS int) )
) AS Table1(num)
WHERE
(Table1.num BETWEEN 124 AND 140 )
OR (Table1.num BETWEEN 143 AND 146 )
OR (Table1.num BETWEEN 148 AND 149 )
OR (Table1.num BETWEEN 160 AND 169 )
OR (Table1.num BETWEEN 181 AND 189 )
)
;
2 .Could this be re-written as follows?
SELECT *
FROM [call]
WHERE TRY_CAST([call].AtextField AS TINYINT) BETWEEN 124 AND 189
AND TRY_CAST([call].AtextField AS TINYINT) NOT IN (141,142,147)
AND TRY_CAST([call].AtextField AS TINYINT) NOT BETWEEN 150 AND 159
AND TRY_CAST([call].AtextField AS TINYINT) NOT BETWEEN 170 AND 180
Note I'm new to CASE in t-sql...
2A. Is the TRY_CAST(...) evaluated more than once?
Which of the above will be quicker?
Is there a better way to write this?
Is the first method useful when the criteria get more involved and complex.
Is this an acceptable approach?
Harvey
There's no need to use exists or 1 = CASE...
Just put your logic in the where clause directly. I'd probably do something like this:
SELECT *
FROM [call]
WHERE TRY_CAST([call].AtextField AS TINYINT) BETWEEN 124 AND 189
AND TRY_CAST([call].AtextField AS TINYINT) NOT IN (141,142,147)
AND TRY_CAST([call].AtextField AS TINYINT) NOT BETWEEN 150 AND 159
AND TRY_CAST([call].AtextField AS TINYINT) NOT BETWEEN 170 AND 180
Cross Apply Method:
SELECT *
FROM [call]
CROSS APPLY (SELECT CAST(PersonID AS TINYINT)) CA(intField)
WHERE intField BETWEEN 124 AND 189
AND intField NOT IN (141,142,147)
AND intField NOT BETWEEN 150 AND 159
AND intField NOT BETWEEN 170 AND 180
My guess is that your query and mine queries will be pretty similiar. If you want to check performance, try running this first and then running each query and recording the logical reads and times.
SET STATISTICS IO ON
SET STATISTICS TIME ON

Recursive CTE with multiple valid same parent child relationships

I have an equipment inventory application I am working on. The piece of equipment is my top level and it contains assemblies, sub-assemblies and parts. I am trying to use recursive CTE to display the parent/child relationships. The issue I am having is that some assemblies can have multiple sub-assemblies that are the same, meaning there is not difference in the part numbers. This is causing my query to not show the correct relationship based on my order by statement. This is the first time I have used CTE so I have be using a lot learned on the web.
PartNumberID 174 is used twice in this assembly.
Sample Table
equipmentID parentPartNumberID partNumberID
17 1 281
17 281 156
17 156 161
17 161 224
17 281 174
17 174 192
17 192 56
17 174 193
17 281 174
17 174 192
17 192 56
17 174 193
17 281 283
17 ` 283 183
17 283 277
17 283 173
Results of Query
PARENT CHILD PARTLEVEL HIERARCHY
1 281 0 281
281 156 1 281.156
156 161 2 281.156.161
161 224 3 281.156.161.224
281 174 1 281.174
281 174 1 281.174
174 192 2 281.174.192
174 192 2 281.174.192
192 56 3 281.174.192.56
192 56 3 281.174.192.56
174 193 2 281.174.193
174 193 2 281.174.193
281 283 1 281.283
283 173 2 281.283.173
283 183 2 281.283.183
283 277 2 281.283.277
As you can see the hierarchy is created correctly but I it is not being returned correctly because there is nothing unique for these 2 assemblies for the order by statement.
The Code:
with parts(PARENT,CHILD,PARTLEVEL,HIERARCHY) as (select parentPartNumberID,
--- Used to get rid of duplicates
CASE WHEN ROW_NUMBER() OVER (PARTITION BY partNumberID ORDER BY partNumberID) > 1
THEN NULL
ELSE partNumberID END AS partNumberID,
0,
CAST( partNumberID as nvarchar) as PARTLEVEL
FROM db.tbl_ELEMENTS
WHERE parentPartNumberID=1 and equiptmentID=17
UNION ALL
SELECT part1.parentPartNumberId,
--- Used to get rid of duplicates
CASE WHEN ROW_NUMBER() OVER (PARTITION BY parts1.partNumberID ORDER BY parts1.partNumberID) > 1
THEN 10000 + parts1.partNumberID
ELSE parts1.partNumberID END,
PARTLEVEL+1,
cast(parts.hierarchy + '.' + CAST(parts1.partNumberID as nvarchar) as nvarchar)
from dbo.tbl_BOM_Elements as parts1 inner
join parts onparts1.parentPartNumberID=parts.CHILD
where id =17)
select CASE WHEN PARENT > 10000
THEN PARENT - 10000
ELSE PARENT END AS PARENT,
CASE WHEN CHILD > 10000
THEN CHILD - 10000
ELSE CHILD END AS CHILD,
PARTLEVEL,HIERARCHY
from parts
order by hierarchy
I tried to create a unique ID to order but was not successful. Any suggestions would be greatly appreciated.
I'll start by just answering the part about getting a sequential id.
If you have control you could just a unique Id to your source table. Having a surrogate primary key would be pretty typical here.
You could instead use a second CTE before the recursive one and add the row numbers there using ROW_NUMBER() OVER BY (ORDER BY equipmentID, parentPartNumberID, partNumberID). Then build your recursive CTE off of that rather than the source table directly.
Better might be to use the first CTE to instead GROUP BY equipmentID, parentPartNumberID, partNumberID and add a COUNT(1) field. This would let you instead use the count in you hierarchy rather than getting the duplicates. Something like 281.283.277x2 or whatever.