Why doesn't "((left union right) union other)" behave associatively? - scala

The code in the following gist is lifted almost verbatim out of a lecture in Martin Odersky's Functional Programming Principles in Scala course on Coursera:
https://gist.github.com/aisrael/7019350
The issue occurs in line 38, within the definition of union in class NonEmpty:
def union(other: IntSet): IntSet =
// The following expression doesn't behave associatively
((left union right) union other) incl elem
With the given expression, ((left union right) union other), largeSet.union(Empty) takes an inordinate amount of time to complete with sets with 100 elements or more.
When that expression is changed to (left union (right union other)), then the union operation finishes relatively instantly.
ADDED: Here's an updated worksheet that shows how even with larger sets/trees with random elements, the expression ((left ∪ right) ∪ other) can take forever but (left ∪ (right ∪ other)) will finish instantly.
https://gist.github.com/aisrael/7020867

The answer to your question is very much connected to Relational databases - and the smart choices they make. When a database "unions" tables - a smart controller system will make some decisions around things like "How large is Table A? Would it make more sense to Join A & B first, or A & C when the user writes:
A Join B Join C
Anyhow, you can't expect the same behavior when you are writing the code by hand - because you have specified the order you want exactly, using parenthesis. None of those smart decisions can happen automatically. (Though in theory they could, and that's why Oracle ,Teradata, mySql exist)
Consider a ridiculously large example:
Set A - 1 Billion Records
Set B - 500 Million Records
Set C - 10 Records
For arguments sake assume that the union operator takes O(N) records by the SMALLEST of the 2 sets being joined. This is reasonable, each key can be looked up in the other as a hashed retrieval:
A & B runtime = O(N) runtime = 500 Million
(let's assume the class is just smart enough to use the smaller of the two for lookups)
so
(A & B) & C
Results in:
O(N) 500 million + O(N) 10 = 500,000,010 comparisons
Again pointing to the fact that it was forced to compare 1 Billion records to 500 Million records FIRST, per inner parenthesis, then - pull in 10 more.
But consider this:
A & (B & C)
Well now something amazing happens:
(B & C) runtime O(N) = 10 record comparisons (each of the 10 C records is checked against B for existence)
then
A & (result) = O(N) = 10
Total = 20 comparisons
Notice that once (B & C) was completed, we only had to bump 10 records against 1 billion!
Both examples produces the exact same result; one in O(N) = 20 runtime, the other in 500,000,010 !
To summarize, this problem illustrates in just a small way some of the complex thinking that goes into database design and the smart optimization that happens in that software. These things do not always happen automatically in programming languages unless you've coded them that way, or by using a library of some sorts. You could for example write a function that takes several sets and intelligently decides the union order. But, the issue becomes unbelievable complex if other set operations have to be mixed in. Hope this helps.

Associativity is not about performance. Two expressions may be equivalent by associativity but one may be vastly harder than the other to actually compute:
(23 * (14/2)) * (1/7)
Is the same as
23 * ((14/2) * (1/7))
But if it were me evaluating the two, I'd reach the answer (23) in a jiffy with the second one, but take longer if I forced myself to work with just the first one.

Related

Postgres extended statistics with partitioning

I am using Postgres 13 and have created a table with columns A, B and C. The table is partitioned by A with 2 possible values. Partition 1 contains 100 possible values each for B and C, whereas partition 2 has 100 completely different values for B, and 1 different value for C. I have set the statistics for both columns to maximum so that this definitely doesn't cause any issue
If I group by B and C on either partition, Postgres estimates the number of groups correctly. However if I run the query against the base table where I really want it, it estimates what I assume is no functional dependency between A, B and C, i.e. (p1B + p1C) * (p2B + p2C) for 200 * 101 as opposed to the reality of p1B * p1C + p2B * p2C for 10000 + 100.
I guess I was half expecting it to sum the underlying partitions rather than use the full count of 200 B's and 101 C's that the base table can see. Moreover, if I also add A into the group by then the estimate erroneously doubles further still, as it then thinks that this set will also be duplicated for each value of A.
This all made me think that I need an extended statistic to tell it that A influences either B or C or both. However if I set one on the base partition and analyze, the value in pg_statistic_ext_data->stxdndistinct is null. Whereas if I set it on the partitions themselves, this does appear to work, though isn't particularly useful because the estimation is already correct at this level. How do I go about having Postgres estimate against the base table correctly without having to run the query against all of the partitions and unioning them together?
You can define extended statistics on a partitioned table, but PostgreSQL doesn't collect any data in that case. You'll have to create extended statistics on all partitions individually.
You can confirm that by querying the collected data after an ANALYZE:
SELECT s.stxrelid::regclass AS table_name,
s.stxname AS statistics_name,
d.stxdndistinct AS ndistinct,
d.stxddependencies AS dependencies
FROM pg_statistic_ext AS s
JOIN pg_statistic_ext_data AS d
ON d.stxoid = s.oid;
There is certainly room for improvement here; perhaps don't allow defining extended statistics on a partitioned table in the first place.
I found that I just had to turn enable_partitionwise_aggregate on to get this to estimate correctly

Performance issue in combined where clauses

Question
I would like to know: How can I rewrite/alter my search query/strategy to get an acceptable performance for my end users?
The search
I'm implementing a search for our users, they are provided the ability to search for candidates on our system based on:
A professional group they fall into,
A location + radius,
A full text search.
The query
select v.id
from (
select
c.id,
c.ts_description,
c.latitude,
c.longitude,
g.group
from entities.candidates c
join entities.candidates_connections cc on cc.candidates_id = c.id
join system.groups g on cc.systems_id = g.id
) v
-- Group selection
where v.group = 'medical'
-- Location + radius
and earth_distance(ll_to_earth(v.latitude, v.longitude), ll_to_earth(50.87050439999999, -1.2191283)) < 48270
-- Full text search
and v.ts_description ## to_tsquery('simple', 'nurse | doctor')
;
Data size & benchmarks
I am working with 1.7 million records
I have the 3 conditions in order of impact which were benchmarked in isolation:
Group clause: 3s & reduces to 700k records
Location clause: 8s & reduces to 54k records
Full text clause: 60s+ & reduces to 10k records
When combined they seem to take 71s, which is the full impact of the 3 queries in isolation, my expectation was that when putting all 3 clauses together they would work sequentially i.e on the subset of data from the previous clause therefore the timing should reduce dramatically - but this has not happened.
What I've tried
All join conditions & where clauses are indexed
Notably the ts_description index (GIN) is 2GB
lat/lng is indexed with ll_to_earth() to reduce the impact inline
I nested each where clause into a different subquery in order
Changed the order of all clauses & subqueries
Increased the shared_buffers size to increase the potential cache hits
It seems you do not need to subquery, and it is also a good practice to filter with numeric fields, so, instead of filtering with where v.group = 'medical' for example, create a dictionary and just filter with where v.group = 1
select
DISTINCT c.id,
from entities.candidates c
join entities.candidates_connections cc on cc.candidates_id = c.id
join system.groups g on cc.systems_id = g.id
where tablename.group = 1
and earth_distance(ll_to_earth(v.latitude, v.longitude), ll_to_earth(50.87050439999999, -1.2191283)) < 48270
and v.ts_description ## to_tsquery(0, 1 | 2)
also, use EXPLAIN ANALYSE to see and check your execution plan. These quick tips will help you improve it clearly.
There were some best practice cases that I had not considered, I have subsequently implemented these to gain a substantial performance increase:
tsvector Index Size Reduction
I was storing up to 25,000 characters in the tsvector, this meant that when more complicated full text search queries were used there was just an immense amount of work to do, I reduced this down to 10,000 which has made a big difference and for my use case this is an acceptable trade-off.
Create a Materialised View
I created a materialised view that contains the join, this offloads a little bit of the work, additionally I built my indexes on there and run a concurrent refresh on a 2 hour interval. This gives me a pretty stable table to work with.
Even though my search yields 10k records I end up paginating on the front-end so I only ever bring back up to 100 results on the screen, this allows me to join onto the original table for only the 100 records I'm going to send back.
Increase RAM & utilise pg_prewarm
I increased the server RAM to give me enough space to store my materialised view into, then ran pg_prewarm on my materialised view. Keeping it in memory yielded the biggest performance increase for me, bringing a 2m query down to 3s.

Merging two tree sets

I am trying to get the union of two sets. The are basically binary trees (but not guaranteed to be balanced). This is the code:
class MyNonEmptySet extends MySet{
union(that: MyNonEmptySet): MySet = {
((left union right) union that) incl elem
}
}
class MyEmptySet extends MySet{
union(that: MyNonEmptySet): MySet = that
}
For smaller data sets the union works fine but when the data is a but larger, union doesn't ever return. It just goes on. I want to understand what is going wrong. If it is not returning it should at least run out of memory (stack overflow exception), right? How can I rectify this?
#EDIT1
It works if I change the paranthesis in the implementation of NonEmptySet.
(left union (right union that)) incl elem
I don't understand why? Both should give out the same result right? Why does one method take forever (but does not go out of memory) and the other works instantly for the same data?
The reason that a binary tree is a good data structure is that it is sorted so you can do fast searches in log n time.
Looks like you do not use a sorted binary tree.
Your second algorithm works but all the work is done by
incl elem
That is rather slow.
The first algorithm has a recursive step that is doing an union of itself, but it will never leave the recursive step.
There are great tree set algorithms in Scala, I would just use one of those.
The right way to merge binary trees is to use red-black trees, but that is non trivial:
https://www.wikiwand.com/en/Red%E2%80%93black_tree

In TSQL, is there any performance difference between SUM(A + B) vs SUM(A) + SUM(B)?

I have to do a sum of 2 fields that are then also summed. Is there any difference from a performance standpoint between doing the addition of the fields first or after the columns have been summed?
Method 1 = SELECT SUM(columnA + columnB)
Method 2 = SELECT SUM(columnA) + SUM(columnB)
(Environment = SQL Server 2008 R2)
I have checked on this, and what i see is that the sum(x) + sum(y) is faster.
Why? When you use a sum function you are working with an aggregate function. When you are aggregating, null values will be skipped in such. When you are combining two fields in an aggregate function the processor has to check if one of the fields is NULL, since a set can contain both a value and a NULL. Adding NULL (or UNKNOWN or NOTHING if you like) to something, is still nothing, so NULL. So for each record this has to be checked.
When you look into your execution plan and you check on the computer scalar operator you'll see exactly this behavior.
For the sum(x) + sum(y) method you see a estimated cpu cost of 0,0000001 where the other method takes up to 0,0000041. That is something more!
Also, when you take a closer look you'll see that the sum(x + y) will be made something like
[Expr1004] = Scalar Operator(CASE WHEN [Expr1006]=(0) THEN NULL ELSE [Expr1007] END)
So, eventually, the sum(x) + sum(y) can be considered faster.

T-SQL speed comparison between LEFT() vs. LIKE operator

I'm creating result paging based on first letter of certain nvarchar column and not the usual one, that usually pages on number of results.
And I'm not faced with a challenge whether to filter results using LIKE operator or equality (=) operator.
select *
from table
where name like #firstletter + '%'
vs.
select *
from table
where left(name, 1) = #firstletter
I've tried searching the net for speed comparison between the two, but it's hard to find any results, since most search results are related to LEFT JOINs and not LEFT function.
"Left" vs "Like" -- one should always use "Like" when possible where indexes are implemented because "Like" is not a function and therefore can utilize any indexes you may have on the data.
"Left", on the other hand, is function, and therefore cannot make use of indexes. This web page describes the usage differences with some examples. What this means is SQL server has to evaluate the function for every record that's returned.
"Substring" and other similar functions are also culprits.
Your best bet would be to measure the performance on real production data rather than trying to guess (or ask us). That's because performance can sometimes depend on the data you're processing, although in this case it seems unlikely (but I don't know that, hence why you should check).
If this is a query you will be doing a lot, you should consider another (indexed) column which contains the lowercased first letter of name and have it set by an insert/update trigger.
This will, at the cost of a minimal storage increase, make this query blindingly fast:
select * from table where name_first_char_lower = #firstletter
That's because most database are read far more often than written, and this will amortise the cost of the calculation (done only for writes) across all reads.
It introduces redundant data but it's okay to do that for performance as long as you understand (and mitigate, as in this suggestion) the consequences and need the extra performance.
I had a similar question, and ran tests on both. Here is my code.
where (VOUCHER like 'PCNSF%'
or voucher like 'PCLTF%'
or VOUCHER like 'PCACH%'
or VOUCHER like 'PCWP%'
or voucher like 'PCINT%')
Returned 1434 rows in 1 min 51 seconds.
vs
where (LEFT(VOUCHER,5) = 'PCNSF'
or LEFT(VOUCHER,5)='PCLTF'
or LEFT(VOUCHER,5) = 'PCACH'
or LEFT(VOUCHER,4)='PCWP'
or LEFT (VOUCHER,5) ='PCINT')
Returned 1434 rows in 1 min 27 seconds
My data is faster with the left 5. As an aside my overall query does hit some indexes.
I would always suggest to use like operator when the search column contains index. I tested the above query in my production environment with select count(column_name) from table_name where left(column_name,3)='AAA' OR left(column_name,3)= 'ABA' OR ... up to 9 OR clauses. My count displays 7301477 records with 4 secs in left and 1 second in like i.e where column_name like 'AAA%' OR Column_Name like 'ABA%' or ... up to 9 like clauses.
Calling a function in where clause is not a best practice. Refer http://blog.sqlauthority.com/2013/03/12/sql-server-avoid-using-function-in-where-clause-scan-to-seek/
Entity Framework Core users
You can use EF.Functions.Like(columnName, searchString + "%") instead of columnName.startsWith(...) and you'll get just a LIKE function in the generated SQL instead of all this 'LEFT' craziness!
Depending upon your needs you will probably need to preprocess searchString.
See also https://github.com/aspnet/EntityFrameworkCore/issues/7429
This function isn't present in Entity Framework (non core) EntityFunctions so I'm not sure how to do it for EF6.