Postgresql index only scan speed - postgresql

So I've got an index only scan returning 750k rows, and sucking it into a cte and doing count star is taking .5 seconds. It's barely using any iops, and maxing out the instance to 16xlarge isn't moving the needle. Switched to bitmap heap scan and it's still giving me .5 seconds. What are some alternatives (other than using a mat view) I can try to speed it up? Or is this just postgres v10 at its finest?

Related

PostgreSQL default_statistics_target not improving row estimation

I am trying to optimize our queries on Postgres which takes minutes sometimes using huge tables. Started looking at query plan and noticed close 1000x difference between estimated number of rows and actual rows on running with EXPLAIN ANALYZE.
This lead me to the parameter default_statistics_target which controls the number rows sampled by ANALYZE command to collect stats used by query planner. As few blogs suggested, I experimented by increased value setting it to 1000 and event to max allowed value of 10000.
Ran ANALYZE every time to ensure it stats are updated. But surprisingly, this did not improve the rows estimation at all. In fact it reduced the estimated value a bit further which seems strange to understand.
Also tested by reducing the value to 10. Which seems to have improved the count a bit. So I am confused if the param actually does what I thought it does. Or if there is some other way to improve row estimation. Any help would be much appreciated.
Postgres version: 9.6
Query plan: At the last index scan step, it has estimated 462 but actual is 1.9M.
https://explain.depesz.com/s/GZY
After changing default_statistics_target = 1000, rows at Index scan step were
-> (cost=0.57..120.42 rows=114 width=32) (actual time=248.999..157947.395 rows=1930518 loops=1)
And on setting it to default_statistics_target = 10, counts were:
-> (cost=0.57..2610.79 rows=2527 width=32) (actual time=390.437..62668.837 rows=1930518 loops=1)
P.S. Table under consideration has more than 100M rows.
This looks like a correlation problem. The planner assumes that the conditions on project_id, event_name_id, and "timestamp" are independent and so multiplies their estimated selectivity. If they are not independent, then no amount of traditional statistics is going to help that. Maybe you need extended statistics
Also, at the time it makes the plan it doesn't even know what value event_name_id will be compared to, as $0 is not determined until run time, so it can't use the value-specific statistics for that. You could execute the subquery manually, then hard code the resulting value into that spot in the query, so the planner knows what the value is while it is planning.

Speed up ST_snap query in PostGIS

Explanation
I have 2 tables in PostgreSQL using the PostGIS extension. Both tables are representing streets as linestrings from a province.
streetsA table (orange lines) has a table size of 96 MB (471026 rows), the second table streetsB (green lines) has a storage size of 78 MB (139708 rows). The streets differ a bit in their positions, that is why I applied a ST_Snap function to match streetsB to streetsA.
create table snapped as select ST_snap(a.geom, b.geom, ST_Distance(a.geom, b.geom)*0.5) from streetsA as a, streetsB as b;
However due to the large size of the tables, the query takes more than 5 hours to complete. I haven't changed anything in the postgres settings. Is it a good idea to perform the query on such a large dataset? Does a spatial index make sense for this query? I am using a 16GB RAM Laptop with Core i7.
The EXPLAIN method gives me following output:
Nested Loop (cost=0.00..5264516749.25 rows=65806100408 width=32)
Seq Scan on streetsa a (cost=0.00..16938.26 rows=471026 width=153)
Materialize (cost=0.00..12127.62 rows=139708 width=206)
Seq Scan on streets b (cost=0.00..11429.08 rows=139708

Slow indexing of 300GB Postgis table

I am loading about 300GB of contour line data in to an postgis table. To speed up the process i read that it is fastest to first load the data, and then create an index. Loading the data only took about 2 days, but now I have been waiting for the index for about 30 days, and it is still not ready.
The query was:
create index idx_contour_geom on contour.contour using gist(geom);
I ran it in pgadmin4, and the memory consumption of the progran has varied from 500MB to 100GB++ since.
Is it normal to use this long time to index such a database?
Any tips on how to speed up the process?
Edit:
The data is loaded from 1x1 degree (lat/lon) cells (about 30.000 cells) so no line has a bounding box larger than 1x1 degree, most of then should be much smaller. They are in EPSG:4326 projection and the only attributes are height and the geometry (geom).
I changed the maintenance_work_mem to 1GB and stopped all other writing to disk (a lot of insert opperations had ANALYZE appended, which took a lot of resources). I now ran in 23min.

Pyaudio/Portaudio issue in synchronizing read/write buffers

My sampler (sound blaster) is 96000hz.
I need to syncronize the output signal with the input signal, so the time_info is critical for us. We need to know the exact time difference between the first sample of of the output buffer and the first sample of the input buffer.
However, the problem is that the time_info gap between two consecutive write or read buffers (callbacks) are not the same. For example:
My buffer is 19200 for the read and write and when I print it each callback I get:
Time gap between READ buffers: 0.19993333332968177274
Time gap between WRITE buffers: 0.20000299999810522422
Time gap between READ buffers: 0.20001300000149058178
Time gap between WRITE buffers: 0.20000299999810522422
Time gap between READ buffers: 0.20000900000013643876
Time gap between WRITE buffers: 0.20000099999742815271
Time gap between READ buffers: 0.19996774999162880704
Time gap between WRITE buffers: 0.19999800000368850306
19200 / 96000 should be 0.2 seconds always, but I get different times in the time_info, which eliminates my ability to sync the output with the input.
I am working with a 40khz sound wave, so in order to sync the phase I need the times to be accurate by 1 microsecond maximum.
Is this a problem in PortAudio? Is this a problem in my sound card? Do these time_info numbers come from the sound card (the hardware) or from PortAudio?
I am using PyAudio (PortAudio binding for Python) in Ubuntu.

Understanding simple PostgreSQL EXPLAIN

I can't understand EXPLAIN of quite simple query:
select * from store order by id desc limit 1;
QUERY PLAN Limit (cost=0.00..0.03 rows=1 width=31)
-> Index Scan Backward using store_pkey on store (cost=0.00..8593.28 rows=250000 width=31)
Why does top level node (limit) have cost lower than nested(index scan) has? As I read from documentation it should be cumulative cost, i.e. 8593.28 + 0.03
The docs (emphasis added) say;
Actually two numbers are shown: the start-up cost before the first row can be returned, and the total cost to return all the rows. For most queries the total cost is what matters, but in contexts such as a subquery in EXISTS, the planner will choose the smallest start-up cost instead of the smallest total cost (since the executor will stop after getting one row, anyway).
In other words, 8593.28 would be the cost to return all the rows, but due to the limit you're only returning one so the actual cost is much lower (more or less equal to the startup cost)
The numbers you see in the top node (0.00..0.03) are (per documentation)
0.00 .. the estimated start-up cost of the node
0.03 .. the estimated total cost of the node
If you want actual total times, run EXPLAIN ANALYZE, which appends actual times for every node. Like:
Limit (cost=0.29..0.31 rows=1 width=30) (actual time=xxx.xxx..xxx.xxx rows=1 loops=1)
Bold emphasis mine.