Cumulative min on earlier versions of PostgreSQL - postgresql

I am using PostgreSQL 8.2, which is main reason why I'm asking this question. I want to get in this version of PostgreSQL a column (let name it C) with cumulative minimum for some other preordered column (let name it B). So on n-th row of column C should be minimum of values of B in rows 1 to n for some ordering.
In example below column A gives order and column C contains cumulative minimum for column B in that order:
A B C
------------
1 5 5
2 4 4
3 6 4
4 5 4
5 3 3
6 1 1
Probably easiest way to explain what I want is what, in later versions, next query does:
SELECT A , B, min (B) OVER(ORDER BY A) C FROM T;
But version 8.2, of course, don't have window functions.
I've written some plpgsql functions that do this on arrays. But to use this I have to use array_agg aggregate function that I again wrote myself (there no built in array_agg in that version). This approach isn't very efficient and while it worked well on smaller tables it becoming almost unusable now that I need to use it on bigger ones.
So I would be very grateful for any suggestions of alternative, more efficient solutions of this problem.
Thank you!

Well, you can use this simple subselect:
SELECT a, b, (SELECT min(b) FROM t t1 WHERE t1.a <= t.a) AS c
FROM t
ORDER BY a;
But I doubt it will be faster for big tables than a plpgsql function. Maybe you can show us your function. There might be room for improvement there.
For this to be fast you should have a multi-column index like:
CREATE INDEX t_a_b_idx ON t (a,b);
But really, you should upgrade your to a more recent version of PostgreSQL. Version 8.2 has reached end of life last year. No more security updates. And so many missing features ...

Related

How do I replace the first 10 entries in a column with NaN in KDB

I am doing calculation on columns using summation. I want to manually change my first n entries in my calc column from float to NaN. Can someone please advise me how to do that?
For example, if my column in table t now is mycol:(1 2 3 4 5 6 7 8 9), I am trying to get a function that can replace the first n=4 entries with NaN, so my column in table t becomes mycol:(0N 0N 0N 0N 5 6 7 8 9)
Thank you so much!
Emily
We can use amend functionality to replace the first n items with null value. Additionally, it would be better to use the appropriate null literal for each column based on the type. Something like this would work:
f: {nullDict: "ijfs"!(0Ni;0Nj;0Nf:`); #[x; til y; :; nullDict .Q.ty x]}
This will amend the first y items in the list x. .Q.ty will get the type for input so that we can get the corresponding value from the dictionary.
You can then use this for a single column, like so:
update mycol: f[mycol;4] from tbl
You can also do this in one go for multiple columns with varying number of items to be replaced using functional form:
![tbl;();0b;`mycol`mycol2!((f[;4];`mycol);(f[;3];`mycol2))]
Do take note that you will need to modify nullDict with whatever other types you need.
Update: Thanks to Jonathon McMurray for suggesting a better way to build up nullDict for all primitive types using the below code:
{x!first each x$\:()}.Q.t except " "

Select a table from the inside of external select

I've seen a technique of use an update (mainly for side-effect of adding a new column, I guess) in a way of: update someFun each t from t. Is it good or bad practice to use such technique?
Some experiments:
t1:([]a:1 2);
t2:([]a:1 2;b:30 40);
update s:{(x`a)+x`b} each t2 from t1
Seems we can use different tables to do this, so I guessed we'll have 2x memory over-use.
But:
t:([]a:til 1000000;b:-1*til 1000000);
\ts:10 s0: update s:{(x`a)+x`b} each t from t;
4761 32778560
\ts:10 s1: update s:{(x`a)+x`b} each ([]a;b) from t;
4124 32778976
\ts:10 s2: update s:{x+y}'[a;b] from t;
1908 32778512
gives almost the same result for all cases for memory. I wonder why memory consumptions are the same?
In all examples you're 'eaching' over rows of the table & it seems the memory consumption is a result of building up the vector incrementally (multiple memory block allocations) rather than in one go. Use vector operations whenever possible
q)n:5000000;t:([]a:til n;b:-1*til n)
q)
q)// each row
q)\ts update s:{(x`a)+x`b} each t from t;
1709 214218848
q)v:n#0
q)\ts {x}each v
361 214218256
q)
q)// vector op
q)\ts update s:sum a b from t;
18 67109760
q)\ts til n
5 67109040
Actually it's already 2x memory used.
Size of t is 16 M from -22!t
And memory used is 32 M

KDB/Q : What is Vector operation?

I am learning KDB+ and Q programming and read about the following statement -
"select performs vector operations on column lists". What does Vector operation mean here? Could somebody please explain with an example? Also, How its faster than standard SQL?
A vector operation is an operation that takes one or more vectors and produces another vector. For example + in q is a vector operation:
q)a:1 2 3
q)b:10 20 30
q)a + b
11 22 33
If a and b are columns in a table, you can perform vector operations on them in a select statement. Continuing with the previous example, let's put a and b vectors in a table as columns:
q)([]a;b)
a b
----
1 10
2 20
3 30
Now,
q)select c:a + b from ([]a;b)
c
--
11
22
33
The select statement performed the same a+b vector addition, but took input and returned output as table columns.
How its faster than standard SQL?
"Standard" SQL implementations typically store data row by row. In a table with many columns the first element of a column and its second element can be separated in memory by the data from other columns. Modern computers operate most efficiently when the data is stored contiguously. In kdb+, this is achieved by storing tables column by column.
A vector is a list of atoms of the same type. Some examples:
2 3 4 5 / int
"A fine, clear day" / char
`ibm`goog`aapl`ibm`msft / symbol
2017.01 2017.02 2017.03m / month
Kdb+ stores and handles vectors very efficiently. Q operators – not just +-*% but e.g. mcount, ratios, prds – are optimised for vectors.
These operators can be even more efficient when vectors have attributes, such as u (no repeated items) and s (items are in ascending order).
When table columns are vectors, those same efficiencies are available. These efficiencies are not available to standard SQL, which views tables as unordered sets of rows.
Being column-oriented, kdb+ can splay large tables, storing each column as a separate file, which reduces file I/O when selecting from large tables.
The sentence means when you refer to a specific column of a table with a column label, it is resolved into the whole column list, rather than each element of it, and any operations on it shall be understood as list operations.
q)show t: flip `a`b!(til 3;10*til 3)
a b
----
0 0
1 10
2 20
q)select x: count a, y: type b from t
x y
---
3 7
q)type t[`b]
7h
q)type first t[`b]
-7h
count a in the above q-sql is equivalent to count t[`a] which is count 0 1 2 = 3. The same goes to type b; the positive return value 7 means b is a list rather than an atom: http://code.kx.com/q/ref/datatypes/#primitive-datatypes

How do I get the averages of duplicate values in a postgresql column?

I have a postgresql table that looks like this
Division Rate
a 7
b 3
c 4
a 5
b 2
a 1
I want to return a table that looks like this
Division Average
a 3.5
b 1.6
c 5
Is there any way for me to do that? I can't seem to come up for the logic for it.
select Division,avg(Rate) from your_table group by Division;

PostgreSQL - optimising joins on latitudes and longitudes comparing distances

I have two tables, say A and B that contain city information with two columns: latitude and longitude. A contains 100,000 records and B contains 1,000,000 records. My objective is to find the rows of B that are within 1 kilometre from A (for each row in A). How do I go about doing this efficiently? I am targeting a time of less than 30 minutes.
The following query takes forever (which I believe is the result of the cross-product of 100,000 * 1,000,000 = 100 billion row comparisons!):
select *
from A
inner join B
on is_nearby(A.latitude, A.longitude, B.latitude, B.longitude)
is_nearby() is just a simple function that finds the difference between the latitudes and longitudes.
I did a test for one row of A, it takes about 5 seconds per row. By my calculation, it is going to take several weeks for the query to finish execution, which is not acceptable.
Yes, PostGIS will make things faster, since it (a) knows how to convert degrees of latitude and longitude to kilometres (I'll use the geography type below), and (b) supports a GiST index, which is optimal for GIS.
Assuming you have PostGIS version 2 available on your system, upgrade your datbase and tables:
CREATE EXTENSION postgis;
-- Add a geog column to each of your tables, starting with table A
ALTER TABLE A ADD COLUMN geog geography(Point,4326);
UPDATE A SET geog = ST_MakePoint(longitude, latitude);
CREATE INDEX ON A USING GIST (geog);
--- ... repeat for B, C, etc.
Now to find the rows of B that are within 1 kilometre from A (for each row in A):
SELECT A.*, B.*, ST_Distance(A.geog, B.geog)/1000 AS dist_km
FROM A
JOIN B ON ST_DWithin(A.geog, B.geog, 1000);