Filter out odd (out of common order) rows - postgresql

Let's assume I have a table with a counter column where all new rows has an equal or larger counter integer value. For example a proper table could like this:
id | counter
-------------
1 | 0
2 | 1
3 | 3
4 | 3
5 | 4
6 | 7
7 | 11
8 | 19
9 | 19
10 | 120
11 | 124
12 | 133
However let's assume some odd rows are inserted that breaks the rule like in this example with 2 NOT good rows :
id | counter
-------------
1 | 0 (first row)
2 | 1 (larger value, good)
3 | 3 (larger value, good)
4 | 3 (same value, good)
5 | 4 (larger value, good)
6 | 21 (larger value, it appears to be good at this point, however next values proves this was an actually an error, *NOT good*, should be removed from the result sets)
7 | 11 (smaller value, it appears to be NOT good at this point, however next values proves this was NOT an error, should NOT be removed from the result sets)
8 | 19 (larger value, good)
9 | 19 (same value, good)
10 | 120 (larger value, good)
11 | 64 (smaller value, *NOT good*)
12 | 133 (larger value, good)
Would it be possible filter/discard those NOT good rows from result sets?

Related

pySpark sequential window function / access previously computed valued (previous row)

I have a pySpark dataframe and need to compute a column which depends on the value of the same column in the previous row. But instead of using the old value of this column of the previous row, I need the new one, to which the calculation was already applied.
Specifically, I need to compute the following: Given a dataframe with a column A in a specific order order. Compute a column B sequentially as B = MAX(0, LAG(B) - LAG(A)), starting with a default value of 0 for the first row.
Example:
Input:
order | A
------|----
0 | -1
1 | -2
2 | 4
3 | 4
4 | -1
5 | 4
6 | -1
Wanted output:
order | A | B
------|----|---
0 | -1 | 0 <- B here is here set to 0
1 | -2 | 1
2 | 4 | 3
3 | 4 | 0
4 | -1 | 0
5 | 4 | 1
6 | -1 | 0
Using the default F.lag window function does not work, since this one yields only the old previous row, since otherwise distributed computing is no longer possible, if it needs to be computed sequentially.

A Postgres query to get subtraction of a value in a row by the value in the next row

I have a table like(mytable):
id | value
=========
1 | 4
2 | 5
3 | 8
4 | 16
5 | 8
...
I need a query to give me subtraction on each rows by next row:
id | value | diff
=================
1 | 4 | 4 (4-Null)
2 | 5 | 1 (5-4)
3 | 8 | 3 (8-5)
4 | 16 | 8 (16-8)
5 | 8 | -8 (8-16)
...
Right now I use a python script to do so, but I guess it's faster if I create a view from this table.
You should use window functions - LAG() in this case:
SELECT id, value, value - LAG(value, 1) OVER (ORDER BY id) AS diff
FROM mytable
ORDER BY id;

kdb+/q: Fast vector update given a list of keys and values to be updated

Given a list of ids/keys and a set of corresponding values for a constant column:
q)ikeys: 1 2 3 5;
q)ivals: 100 100.5 101.5 99.5;
What is the fastest way to update the `toupd column in the following table such that the rows that match the given ikeys are updated to the new values in ivals:i.e.
q) show tab;
ikeys | `toupd `noupd
------|--------------
1 | 0.5 1
2 | 100.5 2
3 | 500.5 4
4 | 400.5 8
5 | 400.5 16
6 | 600.5 32
7 | 700.5 64
is updated to:
q) show restab;
ikeys | `toupd `noupd
------|--------------
1 | 100 1
2 | 100.5 2
3 | 101.5 4
4 | 400.5 8
5 | 99.5 16
6 | 600.5 32
7 | 700.5 64
furthermore, is there a canonical method with which one could update multiple columns in this manner.
thanks
A dot amend is another approach which more easily generalises to more than one column. It can also take advantage of amend-in-place which would be the most memory efficient approach as it doesn't create a duplicate copy of the table in memory (assumes global).
ikeys:1 2 3 5
ivals:100 100.5 101.5 99.5
tab:([ikeys:1+til 7]toupd:.5 100.5 500.5 400.5 400.5 600.5 700.5;noupd:1 2 4 8 16 32 64)
q).[tab;(([]ikeys);`toupd);:;ivals]
ikeys| toupd noupd
-----| -----------
1 | 100 1
2 | 100.5 2
3 | 101.5 4
4 | 400.5 8
5 | 99.5 16
6 | 600.5 32
7 | 700.5 64
/amend in place
.[`tab;(([]ikeys);`toupd);:;ivals]
/generalise to two columns
q).[tab;(([]ikeys);`toupd`noupd);:;flip(ivals;1000 2000 3000 4000)]
ikeys| toupd noupd
-----| -----------
1 | 100 1000
2 | 100.5 2000
3 | 101.5 3000
4 | 400.5 8
5 | 99.5 4000
6 | 600.5 32
7 | 700.5 64
/you could amend in place here too
.[`tab;(([]ikeys);`toupd`noupd);:;flip(ivals;1000 2000 3000 4000)]
Here are two different ways of doing it.
tab lj ([ikeys] toupd: ivals)
or
m: ikeys
update toupd: ivals from tab where ikeys in m
I'm sure there are plenty more ways. If you want to find out which is fastest for your purpose (and your data), try using q)\t:1000 yourCodeHere for large tables and see which suits you best.
As for which is the canonical way for multiple columns, I imagine it would be the update, but it's a matter of personal preference, just do whatever is fastest.
A dictionary is also a common method of updating values given a mapping. Indexing the dictionary with the ikeys column gives the new values and then we fill in nulls with the old toupd column values.
q)show d:ikeys!ivals
1| 100
2| 100.5
3| 101.5
5| 99.5
q)update toupd:toupd^d ikeys from tab
ikeys| toupd noupd
-----| -----------
1 | 100 1
2 | 100.5 2
3 | 101.5 4
4 | 400.5 8
5 | 99.5 16
6 | 600.5 32
7 | 700.5 64
It also worth noting that the update condition with the where clause is not guaranteed to work in all cases, e.g. if you have more mapping values than appear in your ikeys column.
q)m:ikeys:1 2 3 5 7 11
q)ivals:100 100.5 101.5 99.5 100 100
q)update toupd: ivals from tab where ikeys in m
'length

de-aggregate for table columns in Greenplum

I am using Greenplum, and I have data like:
id | val
----+-----
12 | 12
12 | 23
12 | 34
13 | 23
13 | 34
13 | 45
(6 rows)
somehow I want the result like:
id | step
----+-----
12 | 12
12 | 11
12 | 11
13 | 23
13 | 11
13 | 11
(6 rows)
How it comes:
First there should be a Window function, which execute a de-aggreagte function based on partition by id
the column val is cumulative value, and what I want to get is the step values.
Maybe I can do it like:
select deagg(val) over (partition by id) from table_name;
So I need the deagg function.
Thanks for your help!
P.S and Greenplum is based on postgresql v8.2
You can just use the LAG function:
SELECT id,
val - lag(val, 1, 0) over (partition BY id ORDER BY val) as step
FROM yourTable
Note carefully that lag() has three parameters. The first is the column for which to find the lag, the second indicates to look at the previous record, and the third will cause lag to return a default value of zero.
Here is a table showing the table this query would generate:
id | val | lag(val, 1, 0) | val - lag(val, 1, 0)
----+-----+----------------+----------------------
12 | 12 | 0 | 12
12 | 23 | 12 | 11
12 | 34 | 23 | 11
13 | 23 | 0 | 23
13 | 34 | 23 | 11
13 | 45 | 34 | 11
Second note: This answer assumes that you want to compute your rolling difference in order of val ascending. If you want a different order you can change the ORDER BY clause of the partition.
val seems to be a cumulative sum. You can "unaggregate" it by subtracting the previous val from the current val, e.g., by using the lag function. Just note you'll have to treat the first value in each group specially, as lag will return null:
SELECT id, val - COALESCE(LAG(val) OVER (PARTITION BY id ORDER BY val), 0) AS val
FROM mytable;

postgresql Subselect Aggregate in larger query

I'm working with a gigantic dataset of individuals with demographic information and action tracking. I am trying to get the percentage of people who committed an action, which is simple, but also trying to get average ages of people who fit in a specific subgroup of the original SELECT. The CASE WHEN line works fine alone, and the subquery runs fine in it's own query but I cannot seem to get it integrated into this query as a subquery, it gives me a syntax error on the CASE WHEN statement. Here's a slightly anonymized version of the query. Any help would be VERY appreciated.
SELECT
AVG(ageagg)
FROM
(
SELECT
age AS ageagg
FROM
agetable
WHERE
age>30
AND action_taken=1) AvgAge_30Action,
COUNT(
CASE
WHEN action_taken=1
AND age> 30
THEN 1
ELSE 0 NULL) / COUNT(
CASE
WHEN age>30) AS Over_30_Action
FROM
agetable
WHERE
website_type=3
If I've interpreted your intent correctly, you wish to compute the following:
1) the number of people over the age of 30 that took a specific action as a percentage of the total number of people over the age of 30
2) the average age of the people over the age of 30 that took a specific action
Assuming my interpretation is correct, this query might work for you:
SELECT
100 * over_30_action / over_30_total AS percentage_of_over_30_took_action,
average_age_of_over_30_took_action
FROM (
SELECT
SUM(CASE WHEN action_taken=1 THEN 1 ELSE 0 END) AS over_30_action,
COUNT(*) AS over_30_total,
AVG(CASE WHEN action_taken=1 THEN age ELSE NULL END)
AS average_age_of_over_30_took_action
FROM agetable
WHERE website_type=3 AND age>30
) aggregated;
I created a dummy table and populated it with the following data.
postgres=# select * from agetable order by website_type, action_taken, age;
age | action_taken | website_type
-----+--------------+--------------
33 | 1 | 1
32 | 1 | 2
28 | 1 | 3
29 | 1 | 3
32 | 1 | 3
33 | 1 | 3
34 | 1 | 3
32 | 2 | 3
32 | 3 | 3
33 | 4 | 3
34 | 5 | 3
33 | 6 | 3
34 | 7 | 3
35 | 8 | 3
(14 rows)
Of the 14 rows, 4 rows (the first four in this listing) have either the wrong website_type or have age below 30. Of the ten remaining rows, you can see that 3 of them have an action_taken of 1. So, the query should determine that 30% of folks over the age of 30 took a particular action, and the average age among that particular population should be 33 (ages 32, 33, and 34). The results of the query I posted:
percentage_of_over_30_took_action | average_age_of_over_30_took_action
-----------------------------------+------------------------------------
30 | 33.0000000000000000
(1 row)
Again, all of this is predicated upon my interpretation of your intent actually being accurate. This is of course based on a highly contrived data set, but hopefully it's enough of a functional signpost to get you on the right path.