kdb: Where phrase AND performance - kdb

KX reference says #1 should be faster than #2.
q)select from t where c2>15,c3<3.0
q)select from t where (c2>15) and c3<3.0
In the first example, only c3 values corresponding to c2 values
greater than 15 are tested.
In the second example, all c2 values are compared to 15, and all c3
values are compared to 3.0. The two result vectors are ANDed together.
---
However, I noticed the contrary is true in my example below. Am I missing something?
N:100000000;
t:([]a:N?1000;b:N?1000;c:N?1000)
\t select from t where a>500,b>500; / ~500ms
\t select from t where (a>500) and b>500; / ~390ms

It's a general statement - when they say that it should be faster, they mean in most practical situations. There will always be edge-cases where certain approaches aren't faster based on the shape of the data.
In your case since the first filter isn't reducing the dataset by a huge amount (it's halving it), the overhead of "halving" the second column prior to the second filter just happens to be greater than just outright applying the second filter to the whole second column (kdb is very fast at vector operations afterall).
If, for example, your first filter reduced the dataset by a lot then you would see more speed gains as the reduced overhead of the second filter being applied to a smaller set is more dominant:
q)\t select from t where a<10,b>500;
263
q)\t select from t where (a<10) and b>500;
450
q)\t select from t where a=950,b>500;
208
q)\t select from t where (a=950) and b>500;
422
The speed improvement would be even more pronounced if the first column had attributes applied, say `g# (in-memory) or `p# (on disk). And since in most high-volume production scenarios there would be attributes speeding up the first filter, they make the statement that it should be faster (almost implying that if it isn't then you probably aren't making use of attributes!).
Here's an extreme example where the a column has sorted attribute:
q)`a xasc `t;
q)\t select from t where a=950,b>500;
1
q)\t select from t where (a=950) and b>500;
428

Related

How does an index's fill factor relate to a query plan?

When a PostgreSQL query's execution plan is generated, how does an index's fill factor affect whether the index gets used in favor of a sequential scan?
A fellow dev and I were reviewing the performance of a PostgreSQL (12.4) query with a windowed function of row_number() OVER (PARTITION BY x, y, z) and seeing if we could speed it up with an index on said fields. We found that during the course of the query the index would get used if we created it with a fill factor >= 80 but not at 75. This was a surprise to us as we did not expect the fill factor to be considered in creating the query plan.
If we create the index at 75 and then insert rows, thereby packing the pages > 75, then once again the index gets used. What causes this behavior and should we consider it when selecting an index's fill factor on a table that will have frequent inserts and deletes and be periodically vacuumed?
If we create the index at 75 and then insert rows, thereby packing the pages > 75, then once again the index gets used.
So, it is not the fill factor, but rather the size of the index (which is influenced by the fill factor). This agrees with my memory that index size is a (fairly weak) influence on the cost estimate. That influence is almost zero if you are reading only one tuple, but larger if you area reading many tuples.
If the cost estimates of the plan are close to each other, then small differences such as this will be enough to drive one over the other. But that doesn't mean you should worry about them. If one plan is clearly superior to the other, then you should think about why the estimates are so close together to start with when the realities are not close together.

Randomly sampling n rows in impala using random() or tablesample system()

I would like to randomly sample n rows from a table using Impala. I can think of two ways to do this, namely:
SELECT * FROM TABLE ORDER BY RANDOM() LIMIT <n>
or
SELECT * FROM TABLE TABLESAMPLE SYSTEM(1) limit <n>
In my case I set n to 10000 and sample from a table of over 20 million rows. If I understand correctly, the first option essentially creates a random number between 0 and 1 for each row and orders by this random number.
The second option creates many different 'buckets' and then randomly samples at least 1% of the data (in practice this always seems to be much greater than the percentage provided). In both cases I then select only the 10000 first rows.
Is the first option reliable to randomly sample the 10K rows in my case?
Edit: some aditional context. The structure of the data is why the random sampling or shuffling of the entire table seems quite important to me. Additional rows are added to the table daily. For example, one of the columns is country and usually the incoming rows are then first all from country A, then from country B, etc. For this reason I am worried that the second option would maybe sample too many rows from a single country, rather than randomly. Is that a justified concern?
Related thread that reveals the second option: What is the best query to sample from Impala for a huge database?
I beg to differ OP. I prefer second optoin.
First option, you are assigning values 0 to 1 to all of your data and then picking up first 10000 records. so basically, impala has to process all rows in the table and thus the operation will be slow if you have a 20million row table.
Second option, impala randomly picks up rows from files based on percentage you provide. Since this works on the files, so return count of rows may different than the percentage you mentioned. Also, this method is used to compute statistics in Impala. So, performance wise this is much better and correctness of random can be a problem.
Final thought -
If you are worried about randomness and correctness of your random data, go for option 1. But if you are not much worried about randomness and want sample data and quick performance, then pick second option. Since Impala uses this for COMPUTE STATS, i pick this one :)
EDIT : After looking at your requirement, i have a method to sample over a particular field or fields.
We will use window function to set rownumber randomly to each country group. Then pick up 1% or whatever % you want to pick up from that data set.
This will make sure you have data evenly distributed between countries and each country have same % of rows in result data set.
select * from
(
select
row_number() over (partition by country order by country , random()) rn,
count() over (partition by country order by country) cntpartition,
tab.*
from dat.mytable tab
)rs
where rs.rn between 1 and cntpartition* 1/100 -- This is for 1% data
screenshot from my data -
HTH

Does POSTGRES's query optimizer's statistical estimator compute a most common value list for an intermediate product of a multi-table join?

I am reading through Postgres' query optimizer's statistical estimator's code to understand how it works.
For reference, Postgres' query optimizer's statistical estimator estimates the size of the output of an operation (e.g. join, select) in a Postgres plan tree. This allows Postgres to choose between the different ways a query can be executed.
Postgres' statistical estimator uses cached statistics about the contents of each a relation's columns to help estimate output size. The two key saved data structures seem to be:
A most common value (MCV) list: a list of each of the most common values stored in that column and the frequency that they appear in the column.
A histogram of the data stored in that column.
For example, given the table:
X Y
1 A
1 B
1 C
2 A
2 D
3 B
The most common values list for Y would contain {1:0.5, 2:0.333}.
However, when Postgres completes the first join in a multi join operation like in the example below:
SELECT *
FROM A, B, C, D
WHERE A.ID = B.ID AND B.ID2 = C.ID2 AND C.ID3 = D.ID3
the resulting table does not have an MCV (or histogram) (since we've just created the table and we haven't ANALYZEd it! This will make it harder to estimate the output size/cost of the remaining joins.
Does Postgres automatically generate/estimate the MCV (and histogram) for this table to help statistical estimation? If it does, how does it create this MCV?
For reference, here's what I've looked at so far:
The documentation giving a high level overview of how Postgres statistical planner works:
https://www.postgresql.org/docs/12/planner-stats-details.html
The code which carries out the majority of POSTGRES's statistical estimation:
https://github.com/postgres/postgres/blob/master/src/backend/utils/adt/selfuncs.c
The code which generates a relation's MCV:
https://github.com/postgres/postgres/blob/master/src/backend/statistics/mcv.c
Generic logic for clause selectivities:
https://github.com/postgres/postgres/blob/master/src/backend/optimizer/path/clausesel.c
A pointer to the right code file to look at would be much appreciated! Many thanks for your time. :)
The result of a join is called a join relation in PostgreSQL jargon, but that does not mean that it is a “materialized” table that is somehow comparable to a regular PostgreSQL table (which is called a base relation).
In particular, since the join relation does not physically exist, it cannot be ANALYZEd to collect statistics. Rather, the row count is estimated based on the size of the joined relations and the selectivity of the join conditions. This selectivity is a number between 0 (the condition excludes all rows) and 1 (the condition does not filter out anything).
The relevant code is in calc_joinrel_size_estimate in src/backend/optimizer/path/costsize.c, which you are invited to study.
The key points are:
Join conditions that correspond to foreign keys are considered specially:
If all columns in a foreign key are join conditions, then we know that the result of such a join must be as big as the referenced table, so the selectivity is 1 / referenced table size.
Other join conditions are estimated separately by guessing what percentage of rows will be eliminated by that condition.
In the case of an left (or right) outer join, we know that the result size must be at least as big as the left (or right) side.
Finally, the size of the cartesian join (the product of the relation sizes) is multiplied with all selectivities calculated above.
Note that this treats all conditions as independent, which causes bad estimates if the conditions are correlated. But since PostgreSQL doesn't have cross-table statistics, it cannot do better.

quick sort is slower than merge sort

I think the speed of quick sort is less efficient when arranging an array with duplicate data, right? when datatype is char, the bigger the array(over 100000), the closer it gets to the n^2 order.
and assuming there is no duplicate data, to get the best case of a quick sort where the first element is placed as a pivot, first elementsI think we can recursively change the first and intermediate elements by dividing the already aligned array like a merge sort. right? is there general best case?
Lomuto partition scheme, which scans from one end to the other during partition, is slower with duplicates. If all the values are the same, then each partition step splits it into sizes 1 and n-1, a worst case scenario.
Hoare partition scheme, which scans from both both ends towards each other until the indexes (or iterators or pointers) cross, is usually faster with duplicates. Even though duplicates result in more swaps, each swap occurs just after reading and comparing two values to the pivot and are still in the cache for the swap (assuming object size is not huge). As the number of duplicates increases, the splitting improves towards the ideal case where each partition step splits the data into two equal halves. I ran a benchmark sorting 16 million 64 bit integers: with random data, it took about 1.37 seconds, improving with duplicates and with all values the same, it took about about 0.288 seconds.
Another alternative is a 3 way partition, which splits a partition into elements < pivot, elements == pivot, elements > pivot. If all the elements are the same, it's done in O(n) time. For n elements with only k possible values, then time complexity is O(n ⌈log3(k)⌉), and since k is constant, the time complexity is still O(n).
Wiki links:
https://en.wikipedia.org/wiki/Quicksort#Repeated_elements
https://en.wikipedia.org/wiki/Dutch_national_flag_problem

What element of the array would be the median if the the size of the array was even and not odd?

I read that it's possible to make quicksort run at O(nlogn)
the algorithm says on each step choose the median as a pivot
but, suppose we have this array:
10 8 39 2 9 20
which value will be the median?
In math if I remember correct the median is (39+2)/2 = 41/2 = 20.5
I don't have a 20.5 in my array though
thanks in advance
You can choose either of them; if you consider the input as a limit, it does not matter as it scales up.
We're talking about the exact wording of the description of an algorithm here, and I don't have the text you're referring to. But I think in context by "median" they probably meant, not the mathematical median of the values in the list, but rather the middle point in the list, i.e. the median INDEX, which in this cade would be 3 or 4. As coffNjava says, you can take either one.
The median is actually found by sorting the array first, so in your example, the median is found by arranging the numbers as 2 8 9 10 20 39 and the median would be the mean of the two middle elements, (9+10)/2 = 9.5, which doesn't help you at all. Using the median is sort of an ideal situation, but would work if the array were at least already partially sorted, I think.
With an even numbered array, you can't find an exact pivot point, so I believe you can use either of the middle numbers. It'll throw off the efficiency a bit, but not substantially unless you always ended up sorting even arrays.
Finding the median of an unsorted set of numbers can be done in O(N) time, but it's not really necessary to find the true median for the purposes of quicksort's pivot. You just need to find a pivot that's reasonable.
As the Wikipedia entry for quicksort says:
In very early versions of quicksort, the leftmost element of the partition would often be chosen as the pivot element. Unfortunately, this causes worst-case behavior on already sorted arrays, which is a rather common use-case. The problem was easily solved by choosing either a random index for the pivot, choosing the middle index of the partition or (especially for longer partitions) choosing the median of the first, middle and last element of the partition for the pivot (as recommended by R. Sedgewick).
Finding the median of three values is much easier than finding it for the whole collection of values, and for collections that have an even number of elements, it doesn't really matter which of the two 'middle' elements you choose as the potential pivot.