Postgres extended statistics with partitioning - postgresql

I am using Postgres 13 and have created a table with columns A, B and C. The table is partitioned by A with 2 possible values. Partition 1 contains 100 possible values each for B and C, whereas partition 2 has 100 completely different values for B, and 1 different value for C. I have set the statistics for both columns to maximum so that this definitely doesn't cause any issue
If I group by B and C on either partition, Postgres estimates the number of groups correctly. However if I run the query against the base table where I really want it, it estimates what I assume is no functional dependency between A, B and C, i.e. (p1B + p1C) * (p2B + p2C) for 200 * 101 as opposed to the reality of p1B * p1C + p2B * p2C for 10000 + 100.
I guess I was half expecting it to sum the underlying partitions rather than use the full count of 200 B's and 101 C's that the base table can see. Moreover, if I also add A into the group by then the estimate erroneously doubles further still, as it then thinks that this set will also be duplicated for each value of A.
This all made me think that I need an extended statistic to tell it that A influences either B or C or both. However if I set one on the base partition and analyze, the value in pg_statistic_ext_data->stxdndistinct is null. Whereas if I set it on the partitions themselves, this does appear to work, though isn't particularly useful because the estimation is already correct at this level. How do I go about having Postgres estimate against the base table correctly without having to run the query against all of the partitions and unioning them together?

You can define extended statistics on a partitioned table, but PostgreSQL doesn't collect any data in that case. You'll have to create extended statistics on all partitions individually.
You can confirm that by querying the collected data after an ANALYZE:
SELECT s.stxrelid::regclass AS table_name,
s.stxname AS statistics_name,
d.stxdndistinct AS ndistinct,
d.stxddependencies AS dependencies
FROM pg_statistic_ext AS s
JOIN pg_statistic_ext_data AS d
ON d.stxoid = s.oid;
There is certainly room for improvement here; perhaps don't allow defining extended statistics on a partitioned table in the first place.

I found that I just had to turn enable_partitionwise_aggregate on to get this to estimate correctly

Related

kdb: getting one row from HDB

For a normal table, we can select one row using select[1] from t. How can I do this for HDB?
I tried select[1] from t where date=2021.02.25 but it gives error
Not yet implemented: it probably makes sense, but it’s not defined nor implemented, and needs more thinking about as the language evolves
select[n] syntax works only if table is already loaded in memory.
The easiest way to get 1st row of HDB table is:
1#select from t where date=2021.02.25
select[n] will work if applied on already loaded data, e.g.
select[1] from select from t where date=2021.02.25
I've done this before for ad-hoc queries by using the virtual index i, which should avoid the cost of pulling all data into memory just to select a couple of rows. If your query needs to map constraints in first before pulling a subset, this is a reasonable solution.
It will however pull N rows for each date partition selected due to the way that q queries work under the covers. So YMMV and this might not be the best solution if it was behind an API for example.
/ 5 rows (i[5] is the 6th row)
select from t where date=2021.02.25, sum=`abcd, price=1234.5, i<i[5]
If your table is date partitioned, you can simply run
select col1,col2 from t where date=2021.02.25,i=0
That will get the first record from 2021.02.25's partition, and avoid loading every record into memory.
Per your first request (which is different to above) select[1] from t, you can achieve that with
.Q.ind[t;enlist 0]

How to implement application level pagination over ScalarDB

This question is part-Cassandra and part ScalarDB. I am using ScalarDB which provide ACID support on top of Cassandra. The library seem to be working well! Unfortunately, ScalarDB doesn't support pagination though so I have to implement it in the application.
Consider this scenario in which P is primary key, C is clustering key and E is other data within the partition
Partition => { P,C1,E1
P,C2,E1
P,C2,E2
P,C2,E3
P,C2,E4
P,C3,E1
...
P,Cm,En
}
In ScalarDB, I can specify start and end values of keys so I suppose ScalarDB will get data only from the specified rows. I can also limit the no. of entries fetched.
https://scalar-labs.github.io/scalardb/javadoc/com/scalar/db/api/Scan.html
Say I want to get entries E3 and E4 from P,C2. For smaller values, I can specify start and end clustering keys as C2 and set fetch limit to say 4 and ignore E1 and E2. But if there are several hundred records then this method will not scale.
For example say P,C1 has 10 records, P,C2 has 100 records and I want to implement pagination of 20 records per query. Then to implement this, I'll have to
Query 1 – Scan – primary key will be P, clustering start will be C1, clustering end will be Cn as I don’t know how many records are there.
get P,C1. This will give 10 records
get P,C2. This will give me 20 records. I'll ignore last 10 and combine P,C1's 10 with P,C2's first 10 and return the result.
I'll also have to maintain that the last cluster key queried was C2 and also that 10 records were fetched from it.
Query 2 (for next pagination request) - Scan – primary key will be P, clustering start will be C2, clustering end will be Cn as I don’t know how many records are there.
Now I'll fetch P,C2 and get 20, ignore 1st 10 (as they were sent last time), take the remaining 10, do another fetch using same Scan and take first 10 from that.
Is this how it should be done or is there a better way? My concern with above implementation is that every time I'll have to fetch loads of records and dump them. For example, say I want to get records 70-90 from P,C2 then I'll still query up to record 60 and dump the result!
Primary keys and Clustering keys compose a primary key so your above example looks not right.
Let' assume the following data structure.
P, C1, ...
P, C2, ...
P, C3, ...
...
Anyways, I think one of the ways could be as follows. Assuming the page size is 2.
Scan with start (P, C1) inclusive, ascending and with limit 2. Results stored in R1
Get the last record of R1 -> (P, C2).
Scan with start the previous last record (P, C2) not inclusive, ascending with limit 2.
...

Spark union of dataframes does not give counts?

I am trying to union these dataframes ,i used G_ID is not Null or MCOM.T_ID is not null and used trim, the count does not come up ,its running since 1hr. there are only 3 tasks remaining out of 300 tasks.Please suggest how can i debug this ? is null causing issue how can i debug ?
val table1 = spark.sql(""" SELECT trim(C_ID) AS PC_ID FROM ab.CIDS WHERE
_UPDT_TM >= '2020-02-01 15:14:39.527' """)
val table2 = spark.sql(""" SELECT trim(C_ID) AS PC_ID FROM ab.MIDS MCOM INNER
JOIN ab.VD_MBR VDBR
ON Trim(MCOM.T_ID) = Trim(VDBR.T_ID) AND Trim(MCOM.G_ID) = Trim(VDBR.G_ID)
AND Trim(MCOM.C123M_CD) IN ('BBB', 'AAA') WHERE MCOM._UPDT_TM >= '2020-02-01 15:14:39.527'
AND Trim(VDBR.BB_CD) IN ('BBC') """)
var abc=table1.select("PC_ID").union(table2.select("PC_ID"))
even tried this --> filtered = abc.filter(row => !row.anyNull);
It looks like you have a data skew problem. Looking at the "Summary Metrics" it's clear that (at least) three quarters of your partitions are empty, so you are eliminating most of the potential parallelization that spark can provide for you.
Though it will cause a shuffle step (where data gets moved over the network between different executors), a .repartition() will help to balance the data across all of the partitions and create more valid units of work to be spread among the available cores. This would most likely provide a speedup of your count().
As a rule of thumb, you'd likely want to call .repartition() with the parameter set to at least the number of cores in your cluster. Setting it higher will result in tasks getting completed more quickly (it's fun to watch the progress), though adds some management overhead to the overall time the job will take to run. If the tasks are too small (i.e. not enough data per partition), then sometime the scheduler gets confused and won't use the entire cluster either. On the whole, finding the right number of partitions is a balancing act.
You have added alias to the column "C_ID" as "PC_ID". and after that you are looking for "C_ID".
And Union can be performed on same number of columns, your table1 and table2 has different in columns size.
otherwise you will get: org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the same number of columns
Please take care of these two scenario first.

Performance issue in combined where clauses

Question
I would like to know: How can I rewrite/alter my search query/strategy to get an acceptable performance for my end users?
The search
I'm implementing a search for our users, they are provided the ability to search for candidates on our system based on:
A professional group they fall into,
A location + radius,
A full text search.
The query
select v.id
from (
select
c.id,
c.ts_description,
c.latitude,
c.longitude,
g.group
from entities.candidates c
join entities.candidates_connections cc on cc.candidates_id = c.id
join system.groups g on cc.systems_id = g.id
) v
-- Group selection
where v.group = 'medical'
-- Location + radius
and earth_distance(ll_to_earth(v.latitude, v.longitude), ll_to_earth(50.87050439999999, -1.2191283)) < 48270
-- Full text search
and v.ts_description ## to_tsquery('simple', 'nurse | doctor')
;
Data size & benchmarks
I am working with 1.7 million records
I have the 3 conditions in order of impact which were benchmarked in isolation:
Group clause: 3s & reduces to 700k records
Location clause: 8s & reduces to 54k records
Full text clause: 60s+ & reduces to 10k records
When combined they seem to take 71s, which is the full impact of the 3 queries in isolation, my expectation was that when putting all 3 clauses together they would work sequentially i.e on the subset of data from the previous clause therefore the timing should reduce dramatically - but this has not happened.
What I've tried
All join conditions & where clauses are indexed
Notably the ts_description index (GIN) is 2GB
lat/lng is indexed with ll_to_earth() to reduce the impact inline
I nested each where clause into a different subquery in order
Changed the order of all clauses & subqueries
Increased the shared_buffers size to increase the potential cache hits
It seems you do not need to subquery, and it is also a good practice to filter with numeric fields, so, instead of filtering with where v.group = 'medical' for example, create a dictionary and just filter with where v.group = 1
select
DISTINCT c.id,
from entities.candidates c
join entities.candidates_connections cc on cc.candidates_id = c.id
join system.groups g on cc.systems_id = g.id
where tablename.group = 1
and earth_distance(ll_to_earth(v.latitude, v.longitude), ll_to_earth(50.87050439999999, -1.2191283)) < 48270
and v.ts_description ## to_tsquery(0, 1 | 2)
also, use EXPLAIN ANALYSE to see and check your execution plan. These quick tips will help you improve it clearly.
There were some best practice cases that I had not considered, I have subsequently implemented these to gain a substantial performance increase:
tsvector Index Size Reduction
I was storing up to 25,000 characters in the tsvector, this meant that when more complicated full text search queries were used there was just an immense amount of work to do, I reduced this down to 10,000 which has made a big difference and for my use case this is an acceptable trade-off.
Create a Materialised View
I created a materialised view that contains the join, this offloads a little bit of the work, additionally I built my indexes on there and run a concurrent refresh on a 2 hour interval. This gives me a pretty stable table to work with.
Even though my search yields 10k records I end up paginating on the front-end so I only ever bring back up to 100 results on the screen, this allows me to join onto the original table for only the 100 records I'm going to send back.
Increase RAM & utilise pg_prewarm
I increased the server RAM to give me enough space to store my materialised view into, then ran pg_prewarm on my materialised view. Keeping it in memory yielded the biggest performance increase for me, bringing a 2m query down to 3s.

array_agg guaranteed consistent across multiple columns in Postgres?

Suppose I have the following table in Postgres 9.4:
a | b
---+---
1 | 2
3 | 1
2 | 3
1 | 1
If I run
select array_agg(a) as a_agg, array_agg(b) as b_agg from foo
I get what I want
a_agg | b_agg
-----------+-----------
{1,3,2,1} | {2,1,3,1}
The orderings of the two arrays are consistent: the first element of each comes from a single row, as does the second, as does the third. I don't actually care about the order of the arrays, only that they be consistent across columns.
It seems natural that this would "just happen", and it seems to. But is it reliable? Generally, the ordering of SQL things is undefined unless an ORDER BY clause is specified. It is perfectly possible to get postgres to generate inconsistent pairings with inconsistent ORDER BY clauses within array_agg (with some explicitly counterproductive extra work):
select array_agg(a order by b) as agg_a, array_agg(b order by a) as agg_b from foo;
yields
agg_a | agg_b
-----------+-----------
{3,1,1,2} | {2,1,3,1}
This is no longer consistent. The first array elements 3 and 2 did not come from the same original row.
I'd like to be certain that, without any ORDER BY clause, the natural thing just happens. Even with an ordering on either column, ambiguity would remain because of the duplicate elements. I'd prefer to avoid imposing an unambiguous sort, because in my real application, the tables will be large and the sorting might be costly. But I can't find any documentation that guarantees or specifies that, absent imposition of inconsistent orderings, multiple array_agg calls will be ordered consistently, even though it'd be very surprising if they weren't.
Is it safe to assume that the ordering of multiple array_agg columns will be consistently ordered when no ordering is explicitly imposed on the query or within the aggregate functions?
According to PostgreSQL documentation :
Ordinarily, the input rows are fed to the aggregate function in an unspecified order. [...]
However, some aggregate functions (such as array_agg and string_agg) produce results that depend on the ordering of the input rows. When using such an aggregate, the optional order_by_clause can be used to specify the desired ordering.
The way I understand it : you can't be sure that the order of rows is preserved unless you use ORDER BY.
It seems there is a similar (or almost same) question here:
PostgreSQL array_agg order
I prefer ebk's answer
So I think it's fine to assume that all the aggregates, none of which uses ORDER BY, in your query will see input data in the same order. The order itself is unspecified though (which depends on the order the FROM clause supplies rows).
But you can still add order in array_agg function to force same order.