Order preserved in window function without ORDER BY - postgresql

Is unordered_window guaranteed to provide elements in same order when used repeatedly in select? I was hoping to avoid the cost of ordering since the order is irrelevant as long as it's the same.
I.e, will xs[i] and ys[i] always be elements from same row in xyz?
select
array_agg(x) over unordered_window as xs,
array_agg(y) over unordered_window as ys
from
xyz
window
unordered_window as (partition by z);

Use EXPLAIN (VERBOSE, COSTS OFF) to see what happens:
QUERY PLAN
═══════════════════════════════════════════════════════════
WindowAgg
Output: array_agg(x) OVER (?), array_agg(y) OVER (?), z
-> Sort
Output: z, x, y
Sort Key: xyz.z
-> Seq Scan on laurenz.xyz
Output: z, x, y
There is only a single sort, so we can deduce that the order will be the same.
But that is not guaranteed, to it is possible (albeit unlikely) that the implementation may change.
But you see that a sort is performed anyway. You may as well add the ORDER BY; all that will do is another sort key, which won't slow down the execution much. So you might just as well add the ORDER BY and be safe.

Related

Any subset of totally ordered set X is totally ordered for the restriction of the order on X

I do understand every subset of a totally ordered set must be a total order as each a, b in the totally ordered set follows either aRb or bRa.
I don’t understand what the phrase “for the restriction of the order on X” means? Can anyone explain.
An order in a set X is a relation between elements of X, this is, a subset R of the Cartesian product X × X that satisfies the axioms of order (or total order, etc.).
If Y is a subset of X, the restriction to Y of the order on X is the intersection R ∩ Y×Y of the relation R with the subset Y×Y ⊆ X×X.
In other words, is the same order, but restricted to the subset Y.

How does aggregate generalise fold and fold generalise reduce?

As far as I understand aggregate is a generalisation of fold which in turn is a generalisation of reduce.
Similarily combineByKey is a generalisation of aggregateByKey which in turn is a generalisation of foldByKey which in turn is a generalisation of reduceByKey.
However I have trouble finding simple examples for each of those seven methods which can in turn only be expressed by them and not their less general versions. For example I found http://blog.madhukaraphatak.com/spark-rdd-fold/ giving an example for fold, but I have been able to use reduce in the same situation as well.
What I found out so far:
I read that the more generalised methods can be more efficient, but that would be a non-functional requirement and I would like to get examples which can not be implemented with the more specific method.
I also read that e.g. the function passed to fold only has to be associative, while the one for reduce has to be commutative additionally: https://stackoverflow.com/a/25158790/4533188 (However, I still don't know any good simple example.) whereas in https://stackoverflow.com/a/26635928/4533188 I read that fold needs both properties to hold...
We could think of the zero value as a feature (e.g. for fold over reduce) as in "add all elements and add 3" and using 3 as the zero value, but that would be misleading, because 3 would be added for each partition, not just once. Also this is simply not the purpose of fold as far as I understood - it wasn't meant as a feature, but as a necessity to implement it to be able to take non-commutative functions.
What would simple examples for those seven methods be?
Let's work through what is actually needed logically.
First, note that if your collection is unordered, any set of (binary) operations on it need to be both commutative and associative, or you'll get different answers depending on which (arbitrary) order you choose each time. Since reduce, fold, and aggregate all use binary operations, if you use these things on a collection that is unordered (or is viewed as unordered), everything must be commutative and associative.
reduce is an implementation of the idea that if you can take two things and turn them into one thing, you can collapse an arbitrarily long collection into a single element. Associativity is exactly the property that it doesn't matter how you pair things up as long as you eventually pair them all and keep the left-to-right order unchanged, so that's exactly what you need.
a b c d a b c d a b c d
a # b c d a # b c d a b # c d
(a#b) c # d (a#b) # c d a (b#c) d
(a#b) # (c#d) ((a#b)#c) # d a # ((b#c)#d)
All of the above are the same as long as the operation (here called #) is associative. There is no reason to swap around which things go on the left and which go on the right, so the operation does not need to be commutative (addition is: a+b == b+a; concat is not: ab != ba).
reduce is mathematically simple and requires only an associative operation
Reduce is limited, though, in that it doesn't work on empty collections, and in that you can't change the type. If you're working sequentially, you can a function that takes a new type and the old type, and produces something with the new type. This is a sequential fold (left-fold if the new type goes on the left, right-fold if it goes on the right). There is no choice about the order of operations here, so commutativity and associativity and everything are irrelevant. There's exactly one way to work through your list sequentially. (If you want your left-fold and right-fold to always be the same, then the operation must be associative and commutative, but since left- and right-folds don't generally get accidentally swapped, this isn't very important to ensure.)
The problem comes when you want to work in parallel. You can't sequentially go through your collection; that's not parallel by definition! So you have to insert the new type at multiple places! Let's call our fold operation #, and we'll say that the new type goes on the left. Furthermore, we'll say that we always start with the same element, Z. Now we could do any of the following (and more):
a b c d a b c d a b c d
Z#a b c d Z#a b Z#c d Z#a Z#b Z#c Z#d
(Z#a) # b c d (Z#a) # b (Z#c) # d
((Z#a)#b) # c d
(((Z#a)#b)#c) # d
Now we have a collection of one or more things of the new type. (If the original collection was empty, we just take Z.) We know what to do with that! Reduce! So we make a reduce operation for our new type (let's call it $, and remember it has to be associative), and then we have aggregate:
a b c d a b c d a b c d
Z#a b c d Z#a b Z#c d Z#a Z#b Z#c Z#d
(Z#a) # b c d (Z#a) # b (Z#c) # d Z#a $ Z#b Z#c $ Z#d
((Z#a)#b) # c d ((Z#a)#b) $ ((Z#c)#d) ((Z#a)$(Z#b)) $ ((Z#c)$(Z#d))
(((Z#a)#b)#c) # d
Now, these things all look really different. How can we make sure that they end up to be the same? There is no single concept that describes this, but the Z# operation has to be zero-like and $ and # have to be homomorphic, in that we need (Z#a)#b == (Z#a)$(Z#b). That's the actual relationship that you need (and it is technically very similar to a semigroup homomorphism). There are all sorts of ways to pick badly even if everything is associative and commutative. For example, if Z is the double value 0.0 and # is actually +, then Z is zero-like and # is associative and commutative. But if $ is actually *, which is also associative and commutative, everything goes wrong:
(0.0+2) * (0.0+3) == 2.0 * 3.0 == 6.0
((0.0+2) + 3) == 2.0 + 3 == 5.0
One example of a non-trival aggregate is building a collection, where # is the "append an element" operator and $ is the "concat two collections" operation.
aggregate is tricky and requires an associative reduce operation, plus a zero-like value and a fold-like operation that is homomorphic to the reduce
The bottom line is that aggregate is not simply a generalization of reduce.
But there is a simplification (less general form) if you're not actually changing the type. If Z is actually z and is an actual zero, we can just stick it in wherever we want and use reduce. Again, we don't need commutativity conceptually; we just stick in one or more z's and reduce, and our # and $ operations can be the same thing, namely the original # we used on the reduce
a b c d () <- empty
z#a z#b z
z#a (z#b)#c
z#a ((z#b)#c)#d
(z#a)#((z#b)#c)#d
If we just delete the z's from here, it works perfectly well, and in fact is equivalent to if (empty) z else reduce. But there's another way it could work too. If the operation # is also commutative, and z is not actually a zero but just occupies a fixed point of # (meaning z#z == z but z#a is not necessarily just a), then you can run the same thing, and since commutivity lets you switch the order around, you conceptually can reorder all the z's together at the beginning, and then merge them all together.
And this is a parallel fold, which is really a rather different beast than a sequential fold.
(Note that neither fold nor aggregate are strictly generalizations of reduce even for unordered collections where operations have to be associative and commutative, as some operations do not have a sensible zero! For instance, reducing strings by shortest length has as its "zero" the longest possible string, which conceptually doesn't exist, and practically is an absurd waste of memory.)
fold requires an associative reduce operation plus either a zero value or a reduce operation that's commutative plus a fixed-point value
Now, when would you ever use a parallel fold that wasn't just a reduceOrElse(zero)? Probably never, actually, though they can exist. For example, if you have a ring, you often have fixed points of the type we need. For instance, 10 % 45 == (10*10) % 45, and * is both associative and commutative in integers mod 45. Thus, if our collection is numbers mod 45, we can fold with a "zero" of 10 and an operation of *, and parallelize however we please while still getting the same result. Pretty weird.
However, note that you can just plug the zero and operation of fold into aggregate and get exactly the same result, so aggregate is a proper generalization of fold.
So, bottom line:
Reduce requires only an associative merge operation, but doesn't change the type, and doesn't work on empty collecitons.
Parallel fold tries to extend reduce but requires a true zero, or a fixed point and the merge operation must be commutative.
Aggregate changes the type by (conceptually) running sequential folds followed by a (parallel) reduce, but there are complex relationships between the reduce operation and the fold operation--basically they have to be doing "the same thing".
An unordered collection (e.g. a set) always requires an associative and commutative operation for any of the above.
With regard to the byKey stuff: it's just the same as this, except it only applies it to the collection of values associated with a (potentially repeated) key.
If Spark actually requires commutativity where the above analysis does not suggest it's needed, one could reasonably consider that a bug (or at least an unnecessary limitation of the implementation, given that operations like map and filter preserve order on ordered RDDs).
the function passed to fold only has to be associative, while the one for reduce has to be commutative additionally.
It is not correct. fold on RDDs requires the function to be commutative as well. It is not the same operation as fold on Iterable what is pretty well described in the official documentation:
This behaves somewhat differently from fold operations implemented for non-distributed
collections in functional languages like Scala.
This fold operation may be applied to
partitions individually, and then fold those results into the final result, rather than
apply the fold to each element sequentially in some defined ordering. For functions
that are not commutative, the result may differ from that of a fold applied to a
non-distributed collection.
As you can see order of merging partial values is not part of the contract hence function which is used for fold has to be commutative.
I read that the more generalised methods can be more efficient
Technically speaking there should be no significant difference. For fold vs reduce you can check my answers to reduce() vs. fold() in Apache Spark and Why is the fold action necessary in Spark?
Regarding *byKey methods all are implemented using the same basic construct which is combineByKeyWithClassTag and can be reduced to three simple operations:
createCombiner - create "zero" value for a given partition
mergeValue - merge values into accumulator
mergeCombiners - merge accumulators created for each partition.

how can I calculate order of an element in finite field using ntl?

I'm trying to calculate order of an element in finite field (Group) using ntl. but I did not find any function to do this!
can anyone guide me please?
I think there is no built in way to do this.
But you can write a script yourself.
A field F has two operations, addition (+) and multiplication (*). First you have to specify if you want to know the order of an element g in the group (F,+) or the group (F \ {0}, *).
Find the order of g in (F,+):
This is the easy case, since the order of every element in this group is p, if the field has pm elements.
Find the order of g in (F \ {0}, *):
This is a litte bit hard. The order of g in (F \ {0}, *) is also called the discrete logarithm. Basicly you can try gk for every k=1,...,pm. But this will take a while. A simple way would be the baby-step giant-step algorithm.
I have never tried it, but you may also take a look at this discrete logarithm implementation using NTL.

How to count up a value until all geometry features from one table are selected

For example, I have this query to find the minimum distance between two geometries (stored in 2 tables) with a PostGIS function called ST_Distance.
Having thousands of geometries (in both tables) it takes to much time without using ST_DWithin. ST_DWithin returns true if the geometries are within the specified distance of one another (here 2000m).
SELECT DISTINCT ON
(id)
table1.id,
table2.id
min(ST_Distance(a.geom, b.geom)) AS distance
FROM table1 a, table2 b
WHERE ST_DWithin(a.geom, b.geom, 2000.0)
GROUP BY table1.id, table2.id
ORDER BY table1.id, distance
But you have to estimate the distance value to fetch all geometries (e.g. stored in table1). Therefore you have to look at your data in some way in a GIS, or you have to calculate the maximum distance for all (and that takes a lot of time).
In the moment I do it in that way that I approximate the distance value until all features are queried from table1, for example.
Would it be efficient that my query automatically increases (with a reasonable value) the distance value until the count of all geometries (e.g. for table1) is reached? How can I put this in execution?
Would it be slow down everything because the query needs maybe a lot of approaches to find the distance value?
Do I have to use a recursive query for this purpose?
See this post here: K-Nearest Neighbor Query in PostGIS
Basically, the <-> operator is a bit unusual in that it works in the order by clause, but it avoids having to make a guess as to how far you want to search in ST_DWithin. There is a major gotcha with this operator though, which is that the geometry in the order by clause must be a constant that is you CAN NOT write:
select a.id, b.id from table a, table b order by geom.a <-> geom.b limit 1;
Instead you would have to create a loop, substituting in a value above for geom.b
More information can be found here: http://boundlessgeo.com/2011/09/indexed-nearest-neighbour-search-in-postgis/

How do I add another column to a PostgreSQL subquery?

I wasn't too sure how to phrase this question, so here are the details. I'm using a trick to compute the hamming distance between two bitstrings. Here's the query:
select length(replace(x::text,'0',''))
from (
select code # '000111101101001010' as x
from codeTable
) as foo
Essentially it computes the xor between the two strings, removes all 0's, then returns the length. This is functionally equivalent to the hamming distance between two bitstrings. Unfortunately, this only returns the hamming distance, and nothing else. In the codeTable table, there is also a column called person_id. I want to be able to return the minimum hamming distance and the id associated with that. Returning the minimum hamming distance is simple enough, just add a min() around the 'length' part.
select min(length(replace(x::text,'0','')))
from (
select code # '000111101101001010' as x
from codeTable
) as foo
This is fine, however, it only returns the hamming distance, not the person_id. I have no idea what I would need to do to return the person_id associated with that hamming distance.
Does anybody have any idea on how to do this?
Am I missing something? Why the subquery? Looks to me like the following should work to:
select length(replace((code # '000111101101001010')::text,'0',''))
from codeTable
Going from there I get:
select person_id,length(replace((code # '000111101101001010')::text,'0','')) as x
from codeTable
order by x
limit 1
I replaced the min with an order by and limit 1 because there is no direct way of getting the corresponding person_id for the value returned by the min function. In general postgres will be smart enough not to sort the whole intermediate result but just scan it for the row with the lowest value it has to return.