i have table like this
id | name | planned_amount | actual | avail
1 ABC 100 123 100
2 DEF 200 345 200
3 uytuyn 9000 311 300
4 oiui 890 200 200
above data output from query like this:
select a.name,
e.planned_amount,
dd.sum as actual,
e.planned_amount - dd.sum as avail
from crossovered_budget_lines as e,
account_analytic_account as a,
(select analytic_account_id, sum(debit+credit)
from account_move_line
where date between '2021-01-01' and '2021-05-01'
group by analytic_account_id) as dd
where e.analytic_account_id = a.id
and dd.analytic_account_id = .id
limit 5
when range data from table account_move_line is null ( no output data ), i want to output like this, column actual give 0 value :
id | name | planned_amount | actual | avail
1 ABC 100 0 100
2 DEF 200 0 200
3 uytuyn 9000 0 300
4 oiui 890 0 200
how to query to produce the output above ? iam stuck , thanks
We have a database with 450 million rows structured like this:
uid id_1 id_2 d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 d13 d14 d15 d16 d17
81038392 5655067 5468882 373 117 185 152 199 173 168 138 185 159 154 34 38 50 34 41 57
81038393 5655067 5468883 374 116 184 118 170 143 144 113 164 137 138 37 39 53 37 42 60
81038394 5655067 5468884 371 118 187 118 170 143 144 105 157 131 136 32 35 47 32 39 53
81038395 5655067 5468885 370 116 184 118 170 143 144 105 157 131 136 31 35 46 31 38 53
81038396 5655067 5468886 370 117 185 118 170 143 144 105 157 131 136 29 34 44 29 37 50
81038397 5655067 5470853 368 117 185 110 163 137 140 105 157 131 136 34 36 48 34 39 55
81038398 5655067 5470854 372 119 188 118 170 143 144 113 164 137 138 34 36 49 34 40 55
81038399 5655067 5470855 360 115 182 103 151 131 136 98 145 125 131 30 34 45 30 38 51
81038400 5655067 5470856 357 112 177 103 151 131 136 98 145 125 131 30 34 45 30 37 51
81038401 5655067 5470857 356 111 176 103 151 131 136 98 145 125 131 28 33 43 28 36 50
81038402 5655067 5470858 358 113 179 103 151 131 136 98 145 125 131 31 35 46 31 38 52
81038403 5655067 5472811 344 109 173 152 199 173 168 138 185 159 154 31 36 46 31 39 52
81038404 5655068 5468882 373 117 185 152 199 173 168 138 185 159 154 34 38 50 34 41 57
81038405 5655068 5468883 374 116 184 118 170 143 144 113 164 137 138 37 39 53 37 42 60
81038406 5655068 5468884 371 118 187 118 170 143 144 105 157 131 136 32 35 47 32 39 53
81038407 5655068 5468885 370 116 184 118 170 143 144 105 157 131 136 31 35 46 31 38 53
81038408 5655068 5468886 370 117 185 118 170 143 144 105 157 131 136 29 34 44 29 37 50
81038409 5655068 5470853 368 117 185 110 163 137 140 105 157 131 136 34 36 48 34 39 55
81038410 5655068 5470854 372 119 188 118 170 143 144 113 164 137 138 34 36 49 34 40 55
81038411 5655068 5470855 360 115 182 103 151 131 136 98 145 125 131 30 34 45 30 38 51
81038412 5655068 5470856 357 112 177 103 151 131 136 98 145 125 131 30 34 45 30 37 51
81038413 5655068 5470857 356 111 176 103 151 131 136 98 145 125 131 28 33 43 28 36 50
81038414 5655068 5470858 358 113 179 103 151 131 136 98 145 125 131 31 35 46 31 38 52
We need to constantly do queries like this:
Query 1:
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM mytable WHERE id_1 = 5655067;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Index Scan using id_1_idx on mytable (cost=0.57..99187.68 rows=25742 width=80) (actual time=47.081..2600.899 rows=21487 loops=1)
Index Cond: (id_1 = 5655067)
Buffers: shared hit=9 read=4816
I/O Timings: read=2563.181
Planning time: 0.151 ms
Execution time: 2602.320 ms
(6 rows)
Query 2:
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM mytable WHERE id_2 = 5670433;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on mytable (cost=442.02..89887.42 rows=23412 width=80) (actual time=113.200..42127.512 rows=21487 loops=1)
Recheck Cond: (id_2 = 5670433)
Heap Blocks: exact=16988
Buffers: shared hit=30 read=17020
I/O Timings: read=41971.798
-> Bitmap Index Scan on id_2_idx (cost=0.00..436.16 rows=23412 width=0) (actual time=104.928..104.929 rows=21487 loops=1)
Index Cond: (id_2 = 5670433)
Buffers: shared hit=2 read=60
I/O Timings: read=99.235
Planning time: 0.163 ms
Execution time: 42132.556 ms
(11 rows)
There are around 23 000 to 25 000 unique
id_1 and id_2 values and both queries will always return around 24 000 rows of the data. We are only reading data and the data does not change over time.
The problem:
The Query 1 takes around 3 seconds, which is a bit much but still bearable.
The Query 2 takes up to 30-40 seconds, which is way too much for us as the service is interactive web service.
We have indexed id_1 and id_2. We also added a joint index on id_1 and id_2 as this was suggested by Azure PostgreSQL As A Service platform where the data is located. It did not help.
My assumption is that the Query 1 is fast since all the rows are located sequentially in the database, whereas when Query 2 is used the the rows are always distributed throughout the whole database non-sequentially.
Restructuring the data to speed up the Query 2 is not a good idea as that would reduce the performance is Query 1. I understand that the way this data is structured is not ideal, but I do not have control over it. Any suggestions how I could speed up the Query 2 to reasonable level?
Edit 2:
Create index statements:
CREATE INDEX id_1_idx ON mytable (id_1);
CREATE INDEX id_2_idx ON mytable (id_2);
Vacuuming the table did not change the plan. The outputs from EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM mytable WHERE id_1 = 5655067 are very similar after the vacuuming. Here is the output from verbose vacuum:
VACUUM (VERBOSE, ANALYZE) mytable;
INFO: vacuuming "public.mytable"
INFO: index "mytable_pkey" now contains 461691169 row versions in 1265896 pages
DETAIL: 0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 2695.21 s.
INFO: index "id_1_idx" now contains 461691169 row versions in 1265912 pages
DETAIL: 0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 1493.20 s.
INFO: index "id_2_idx" now contains 461691169 row versions in 1265912 pages
DETAIL: 0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 1296.06 s.
INFO: index "mytable_id_1_id_2_idx" now contains 461691169 row versions in 1265912 pages
DETAIL: 0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 2364.16 s.
INFO: "mytable": found 0 removable, 389040319 nonremovable row versions in 5187205 out of 6155883 pages
DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 12767
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 13560.60 s.
INFO: analyzing "public.mytable"
INFO: "mytable": scanned 30000 of 6155883 pages, containing 2250000 live rows and 0 dead rows; 30000 rows in sample, 461691225 estimated total rows
VACUUM
TL;DR
Storage I/O is you major bottleneck + not enough RAM for indexes as you can simply calculate yourself:
For the bitmap heap scan you can calculate an average block read latency of ~2.5 milliseconds (17020 blocks read in 41971.798 ms), which is way too slow.
The only way to avoid disk reads is lots of RAM. Faster storage would make the system far more scalable as most likely this is not the only type of queries and not the only table in the database.
Long Version:
Reading the perfect output of the EXPLAIN it indicates that the cost evaluation done by the planner is way off and that the performance drop comes from disk reads.
As you wrote that the data does not change over time (and hence, you know the value ranges in advance) you can also range-partition your table on those two columns, which then would have to only scan a certain partition (using smaller indexes, reading smaller table heap). But if the application accessing this data is eventually accessing the full range of data more or less than this would also not help much.
As a result, you should think about replacing the storage subsystem to be able to handle your queries within the performance requirements that your application has.
I have the suspect that the PostgreSQL server is still running on HDD rather than SSD. A little test with only 120M rows shows the following characteristics for both indexes:
create table nums (uid integer primary key, id_1 integer, id_2 integer, d1 integer, d2 integer, d3 integer, d4 integer, d5 integer, d6 integer, d7 integer, d8 integer, d9 integer, d10 integer, d11 integer, d12 integer, d13 integer, d14 integer, d15 integer, d16 integer, d17 integer);
INSERT INTO nums select generate_series(80000001, 200000000) AS uid, (random() * 23000)::integer + 5600000 AS id_1, (random() * 25000)::integer + 5600000 AS id_2, (random() * 1000)::integer AS d1, (random() * 1000)::integer AS d2, (random() * 1000)::integer AS d3, (random() * 1000)::integer AS d4, (random() * 1000)::integer AS d5, (random() * 1000)::integer AS d6, (random() * 1000)::integer AS d7, (random() * 1000)::integer AS d8, (random() * 1000)::integer AS d9, (random() * 1000)::integer AS d10, (random() * 1000)::integer AS d11, (random() * 100)::integer AS d12, (random() * 100)::integer AS d13, (random() * 100)::integer AS d14, (random() * 100)::integer AS d15, (random() * 100)::integer AS d16, (random() * 100)::integer AS d17;
create index id_1_idx on nums (id_1);
create index id_2_idx on nums (id_2);
cluster nums using id_1_idx;
...resulting into the following (both cold reads):
explain (analyze, buffers) select * from nums where id_1 = 5606001;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Index Scan using id_1_idx on nums (cost=0.57..5816.92 rows=5198 width=80) (actual time=1.680..6.394 rows=5185 loops=1)
Index Cond: (id_1 = 5606001)
Buffers: shared read=88
I/O Timings: read=4.397
Planning Time: 4.002 ms
Execution Time: 7.475 ms
(6 rows)
Time: 15.924 ms
...and for id_2:
explain (analyze, buffers) select * from nums where id_2 = 5606001;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Index Scan using id_2_idx on nums (cost=0.57..5346.53 rows=4777 width=80) (actual time=0.376..985.689 rows=4748 loops=1)
Index Cond: (id_2 = 5606001)
Buffers: shared hit=1 read=4755
I/O Timings: read=972.555
Planning Time: 0.203 ms
Execution Time: 986.590 ms
(6 rows)
Time: 987.296 ms
So although my table is "just" 12 GiB + 3x 2.5 GiB (PK + 2 indexes) is is still fast enough.
In case the server already is running on SSD, please make sure to (physically) separate data storage for WAL/log, table data (tablespace), indexes (tablespace) to benefit as much as possible from parallelism and to reduce I/O interference caused by other services/applications on the same system.
Also think about a server system with way more memory for the table and index data (for this ~ 48 GiB table + ~10 GiB per index, assuming all integer columns) and then do a warm-up to push data from disk into memory. At least indexes should be able to completely stay in memory.
EDIT:
The reason my server does not use a bitmap (index + heap) scan is because I am running on SSD and I have adapted the random page cost from default of 4 down to 1.1. For an HDD system, that would make no sense, of course.
EDIT #2:
A retest of the situation has revealed an interesting behavior:
In my test, I assumed the first column uid to be the primary key column and be a serial (sequential integer), by which the entries are initially sorted on disk. While generating the data, the values for the both interesting indexed columns id_1 and id_2 are generated randomly, which usually ends up being worst case for big tables.
However, not so in this case. After creating the test data and the indexes and after analyzing the table but before the data reordering using the index on column id_1 I am getting these results now:
explain (analyze, buffers) select * from nums where id_1 = 5606001;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on nums (cost=63.32..7761.68 rows=5194 width=80) (actual time=1.978..41.007 rows=5210 loops=1)
Recheck Cond: (id_1 = 5606001)
Heap Blocks: exact=5198
Buffers: shared read=5217
I/O Timings: read=28.732
-> Bitmap Index Scan on id_1_idx (cost=0.00..62.02 rows=5194 width=0) (actual time=1.176..1.176 rows=5210 loops=1)
Index Cond: (id_1 = 5606001)
Buffers: shared read=19
I/O Timings: read=0.124
Planning Time: 7.214 ms
Execution Time: 41.419 ms
(11 rows)
...and:
explain (analyze, buffers) select * from nums where id_2 = 5606001;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on nums (cost=58.52..7133.04 rows=4768 width=80) (actual time=7.305..43.830 rows=4813 loops=1)
Recheck Cond: (id_2 = 5606001)
Heap Blocks: exact=4805
Buffers: shared hit=12 read=4810
I/O Timings: read=28.181
-> Bitmap Index Scan on id_2_idx (cost=0.00..57.33 rows=4768 width=0) (actual time=5.102..5.102 rows=4813 loops=1)
Index Cond: (id_2 = 5606001)
Buffers: shared read=17
I/O Timings: read=2.414
Planning Time: 0.227 ms
Execution Time: 44.197 ms
(11 rows)
All plans + optimizations available here:
using id_1_idx
using id_2_idx
I also followed my own best practices and separated out the indexes to another tablespace on different physical SSD here.
As we can see, to fetch the ~5000 resulting rows it has to read more or less the same number of blocks here, in both cases using the bitmap heap scan.
The correlation for the two columns in this case:
attname | correlation | n_distinct
---------+-------------+------------
id_1 | -0.0047043 | 23003
id_2 | 0.00157998 | 25004
Now, retesting the queries after the CLUSTER ... USING id_1_idx and after re-analyzing it, resulting in the following correlation:
attname | correlation | n_distinct
---------+--------------+------------
id_1 | 1 | 22801
id_2 | -0.000898521 | 24997
...revealed the following performances:
explain (analyze, buffers) select * from nums where id_1 = 5606001;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Index Scan using id_1_idx on nums (cost=0.57..179.02 rows=5083 width=80) (actual time=2.604..5.256 rows=5210 loops=1)
Index Cond: (id_1 = 5606001)
Buffers: shared read=90
I/O Timings: read=4.107
Planning Time: 4.039 ms
Execution Time: 5.563 ms
(6 rows)
...which is much better - just as expected - but:
explain (analyze, buffers) select * from nums where id_2 = 5606001;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on nums (cost=58.57..7140.12 rows=4775 width=80) (actual time=5.866..99.707 rows=4813 loops=1)
Recheck Cond: (id_2 = 5606001)
Heap Blocks: exact=4806
Buffers: shared read=4823
I/O Timings: read=31.389
-> Bitmap Index Scan on id_2_idx (cost=0.00..57.38 rows=4775 width=0) (actual time=2.992..2.992 rows=4813 loops=1)
Index Cond: (id_2 = 5606001)
Buffers: shared read=17
I/O Timings: read=0.338
Planning Time: 0.210 ms
Execution Time: 100.155 ms
(11 rows)
...more than twice as slow, despite the fact that almost the exact same number of blocks had to be read as in the first random run.
Why does it slow down so much?
The physical re-ordering of the table data using index id_1_idx also affected the physical order for the column. Now, the purpose of the bitmap heap scan is to get a list of blocks to read in physical (on-disk) order from the bitmap index scan. In the first case (random), there was quite a good chance that multiple rows matching the criteria where located in consecutive blocks on-disk, resulting in less random disk access.
Interestingly (but this might just be because I am running on SSD), disabling the bitmap scan revealed acceptable numbers:
explain (analyze, buffers) select * from nums where id_2 = 5606001;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Index Scan using id_2_idx on nums (cost=0.57..7257.12 rows=4775 width=80) (actual time=0.151..35.453 rows=4813 loops=1)
Index Cond: (id_2 = 5606001)
Buffers: shared read=4823
I/O Timings: read=30.051
Planning Time: 1.927 ms
Execution Time: 35.810 ms
(6 rows)
All these numbers are almost complete cold-start executions (as you can see with no or very low Buffers: shared hit numbers.
Interesting also is that the I/O timings are pretty similar between the bitmap scan and index scan for id_2, but the bitmap scan seems to introduce a huge overhead here.
The difference is that id_1 is highly correlated, i.e. the order of that column corresponds to the physical order of the rows, while id_2 is not correlated.
Test with
SELECT attname, correlation
FROM pg_stats
WHERE tablename = 'mytable'
AND attname IN ('id_1', 'id_2');
If the correlation is high, the rows for a single value of the column will be in a few adjacent blocks of the table. If the correlation is low, the rows will be all over the table and many more blocks have to be read.
To achieve high correlation, you can rewrite a table using the CLUSTER statement to reorder the rows. If there are no deletes and updates, a table will be physically ordered in insertion order.
You can speed up one query or the other, but not both.
I have a table containing all the trips taken by different cars. I've filtered down this table to trips that had multiple stops specifically. Now all i want to do is get the first stop that each car had.
What i've got is:
Car ID
Date_depart
Date_arrive
Count (from a previous table creation)
I've filtered this table by using Car ID + Date Depart and making a count where there are multiple date_arrives for a single date_depart. Now i'm trying to figure out how to only get back the first stop but am completely stuck. Outside of doing the lateral join X, order by Z limit 1 etc method; i have no idea how to get back only the first result in this table.
Here's some sample data:
Car ID Date_depart Date_arrive Count
949 2017-01-01 2017-01-05 2
949 2017-01-01 2017-01-09 2
1940 2017-01-09 2017-01-11 3
1940 2017-01-09 2017-01-14 3
1940 2017-01-09 2017-01-28 3
949 2018-04-19 2018-04-23 2
949 2018-04-19 2018-04-26 2
and the expected result would be:
Car ID Date_depart Date_arrive Count
949 2017-01-01 2017-01-05 2
1940 2017-01-09 2017-01-11 3
949 2018-04-19 2018-04-23 2
Any help?
You need DISTINCT ON
SELECT DISTINCT ON (date_depart, car_id)
*
FROM
trips
ORDER BY date_depart, car_id, date_arrive
This gives you the first (ordered) row of each group (date_depart, car_id)
demo: db<>fiddle
I am new to KDB and cannot understand why i can access the order column for the table stocks but not trader. The below is my code with the error.
q)trader
item brand | price order
---------------| -----------
soda fry | 1.5 200
bacon prok | 1.99 180
mushroom veggie| 0.88 110
eggs veggie| 1.55 210
tomatoes veggie| 1.35 100
q)trader.order
'order
[0] trader.order
^
q)stock.order
50 82 45 92
q)stock
item brand price order
---------------------------
soda fry 1.5 50
bacon prok 1.99 82
mushroom veggie 0.88 45
eggs veggie 1.55 92
q)trader.order
'order
[0] trader.order
^
Your table trader is keyed and you cannot use trader.order to select the order column.
You can use this instead if you want
(0!trader)`order
The reason is because when you do trader.order what you actually do is you use indexing. It's the same as if you'd do list.index. A table is just a list of dictionaries and you use dot(.) to index into it. However a keyed table does not have the same structure so you'll have to unkey it first.
In PostgreSQL I am looking for an answer to the following problem.
There are two columns providing data about 'start' and 'end', together with a 'date' column. Currently the date column only exists once with 'start' and 'end' being filled with possibilities.
I am looking for the possibility to create a 'start' and 'end' column with unique values, but with duplicating dates.
current:
id date start end
1 2017-03-13 a [null]
2 2017-03-14 [null] a
3 2017-03-14 b [null]
4 2017-03-16 [null] b
5 2017-03-16 c c
wish:
id date start end
1 2017-03-13 a [null]
2 2017-03-14 [null] a
3 2017-03-14 b [null]
4 2017-03-16 [null] b
5 2017-03-16 c [null]
6 2017-03-16 [null] c
Anyone an idea?
If I understood your problem correctly, and you want exactly one of start and "end" to be set, and the combination with date unique, you can do this:
ALTER TABLE tab
ADD CHECK(start IS NULL AND "end" IS NOT NULL
OR start IS NOT NULL AND "end" IS NULL);
CREATE UNIQUE INDEX ON tab (date, COALESCE(start, "end"));