Postgres 9.3 count rows matching a column relative to row's timestamp - postgresql

I've used WINDOW functions before but only when working with data that has a fixed cadence/interval. I am likely missing something simple in aggregation but I've never had a scenario where I'm not working with fixed intervals.
I have a table the records samples at arbitrary timestamps. A sample is only recorded when it is a delta from the previous sample and the sample rate is completely irregular due to a large number of conditions. The table is very simple:
id (int)
happened_at (timestamp)
sensor_id (int)
new_value (float)
I'm trying to construct a query that will include a count of all of the samples before the happened_at of a given result row. So given an ultra simple 2 row sample data set:
id|happened_at |sensor_id| new_value
1 |2019-06-07:21:41|134679 | 123.331
2 |2019-06-07:19:00|134679 | 100.009
I'd like the result set to look like this:
happened_at |sensor_id | new_value | sample_count
2019-06-07:21:41|134679 |123.331 |2
2019-06-07:19:00|134679 |123.331 |1
I've tried:
SELECT *,
(SELECT count(sample_history.id) OVER (PARTITION BY score_history.sensor_id
ORDER BY sample_history.happened_at DESC))
FROM sensor_history
ORDER by happened_at DESC
and the duh not going to work.
(SELECT count(*)
FROM sample_history
WHERE sample_history.happened_at <= sample_timestamp)
Insights greatly appreciated.

Get rid of the SELECT (sub-query) when using the window function.
SELECT *,
count(*) OVER (PARTITION BY sensor_id ORDER BY happened_at DESC)
FROM sensor_history
ORDER BY happened_at DESC

Related

How can I rank a table in postgresql and then find the rank of a specific row?

I have a postgresql table
cubing=# SELECT * FROM times;
count | name | time
-------+---------+--------
4 | sean | 32.97
5 | Austin | 15.64
6 | Kirk | 117.02
I retrieve all from it with SELECT * FROM times ORDER BY time ASC. But now I want to give the user the option to search for a specific value (say, WHERE name = Austin) and have it tell them what rank they are in the table. Right now, I have SELECT name,time, RANK () OVER ( ORDER BY time ASC) rank_number FROM times. From how I understand it, that is giving me the rank of the entire table. I would like the rank, name, and time of who I am searching for. I am afraid if I added a where clause to my last SELECT statement with the name Austin, it would only find where the name equals Austin and rank those, rather than the rank of Austin in the rest of the table.
thanks for reading
I think the behavior you want here is to first rank your current data, then query it with some WHERE filter:
WITH cte AS (
SELECT *, RANK() OVER (ORDER BY time) rank_number
FROM times
)
SELECT count, name, time
FROM cte
WHERE name = 'Austin';
The point here is that at the time we do a query searching for Austin, the ranks for each row in your original table have already been generated.
Edit:
If you're running this query from an application, it would probably be best to avoid CTE syntax. Instead, just inline the CTE as a subquery:
SELECT count, name, time, rank_number
FROM
(
SELECT *, RANK() OVER (ORDER BY time) rank_number
FROM times
) t
WHERE name = 'Austin';

postgreSQL, first date when cummulative sum reaches mark

I have the following sample table
And the output should be the first date (for each id) when cum_rev reaches the 100 mark.
I tried the following, because I taught with group bz trick and the where condition i will only get the first occurrence of value higher than 100.
SELECT id
,pd
,cum_rev
FROM (
SELECT id
,pd
,rev
,SUM(rev) OVER (
PARTITION BY id
ORDER BY pd
) AS cum_rev
FROM tab1
)
WHERE cum_rev >= 100
GROUP BY id
But it is not working, and I get the following error. And also when I add an alias is not helping
ERROR: subquery in FROM must have an alias LINE 4: FROM (
^ HINT: For example, FROM (SELECT ...) [AS] foo.
So the desired output is:
2 2015-04-02 135.70
3 2015-07-03 102.36
Do I need another approach? Can anyone help?
Thanks
demo:db<>fiddle
SELECT
id, total
FROM (
SELECT
*,
SUM(rev) OVER (PARTITION BY id ORDER BY pd) - rev as prev_total,
SUM(rev) OVER (PARTITION BY id ORDER BY pd) as total
FROM tab1
) s
WHERE total >= 100 AND prev_total < 100
You can use the cumulative SUM() window function for each id group (partition). To find the first which goes over a threshold you need to check the previous value for being under the threshold while the current one meets it.
PS: You got the error because your subquery is missing an alias. In my example its just s

How to use new created column in where column in sql?

Hi I have a query which looks like the following :
SELECT device_id, tag_id, at, _deleted, data,
row_number() OVER (PARTITION BY device_id ORDER BY at DESC) AS row_num
FROM mdb_history.devices_tags_mapping_history
WHERE at <= '2019-04-01'
AND _deleted = False
AND (tag_id = '275674' or tag_id = '275673')
AND row_num = 1
However when I run the following query, I get the following error :
ERROR: column "row_num" does not exist
Is there any way to go about this. One way I tried was to use it in the following way:
SELECT * from (SELECT device_id, tag_id, at, _deleted, data,
row_number() OVER (PARTITION BY device_id ORDER BY at DESC) AS row_num
FROM mdb_history.devices_tags_mapping_history
WHERE at <= '2019-04-01'
AND _deleted = False
AND (tag_id = '275674' or tag_id = '275673')) tag_deleted
WHERE tag_deleted.row_num = 1
But this becomes way too complicated as I do it with other queries as I have number of join and I have to select the column as stated from so it causes alot of select statement. Any smart way of doing that in a more simpler way. Thanks
You can't refer to the row_num alias which you defined in the same level of the select in your query. So, your main option here would be to subquery, where row_num would be available. But, Postgres actually has an option to get what you want in another way. You could use DISTINCT ON here:
SELECT DISTINCT ON (device_id), device_id, tag_id, at, _deleted, data
FROM mdb_history.devices_tags_mapping_history
WHERE
at <= '2019-04-01' AND
_deleted = false AND
tag_id IN ('275674', '275673')
ORDER BY
device_id,
at DESC;
Too long/ formatted for a comment. There is a reason behind #TimBiegeleisen statement "alias which you defined in the same level of the select". That reason is that all SQL statement follow the same sequence for evaluation. Unfortunately that sequence does NOT follow the sequence of clauses within the statement presentation. that sequence is in order:
from
where
group by
having
select
limits
You will notice that what actually gets selected fall well after evaluation of the where clause. Since your alias is defined within the select phase it does not exist during the where phase.

Make postgresql timestamps unique

I have a dataset with 6M+ rows including timestamps from about 2003 to current. In 2014 the database was migrated to postgresql and the timestamp column became unique due to the higher precision of timestamps. The original ID column was not migrated. About 300k of the timestamps are repeated at least once. I want to modify the timestamp column so that they are unique by adding precision (all non-unique timestamps only go to the second).
I have this
ts message
--------------------|---------------
2014-02-01 07:40:37 | message1
2014-02-01 07:40:37 | message2
I want this
ts message
-------------------------|---------------
2014-02-01 07:40:37.0000 | message1
2014-02-01 07:40:37.0001 | message2
This should work, but it will be horribly slow I guess:
update the_table
set ts = ts + '1 millisecond'::interval * x.rn
from (
select ctid, row_number() over (order by ts) as rn
from the_table
) x
where the_table.ctid = x.ctid;
The column ctid is an internal unique identifier (actually the physical address of the row) maintained by Postgres.
You might want to add another where condition to pick out only those rows that need to be modified.
One simple solution is to try to add a random interval to the timestamp:
update t
set ts = ts + random() * interval '1000000 microsecond'
where ts = date_trunc('second', ts)
The chance of a collision is very low. If it occurs use #a_horse's answer

How to rank in postgres query

I'm trying to rank a subset of data within a table but I think I am doing something wrong. I cannot find much information about the rank() feature for postgres, maybe I'm looking in the wrong place. Either way:
I'd like to know the rank of an id that falls within a cluster of a table based on a date. My query is as follows:
select cluster_id,feed_id,pub_date,rank
from (select feed_id,pub_date,cluster_id,rank()
over (order by pub_date asc) from url_info)
as bar where cluster_id = 9876 and feed_id = 1234;
I'm modeling this after the following stackoverflow post: postgres rank
The reason I think I am doing something wrong is that there are only 39 rows in url_info that are in cluster_id 9876 and this query ran for 10 minutes and never came back. (actually re-ran it for quite a while and it returned no results, yet there is a row in cluster 9876 for id 1234) I'm expecting this will tell me something like "id 1234 was 5th for the criteria given). It will return a relative rank according to my query constraints, correct?
This is postgres 8.4 btw.
By placing the rank() function in the subselect and not specifying a PARTITION BY in the over clause or any predicate in that subselect, your query is asking to produce a rank over the entire url_info table ordered by pub_date. This is likely why it ran so long as to rank over all of url_info, Pg must sort the entire table by pub_date, which will take a while if the table is very large.
It appears you want to generate a rank for just the set of records selected by the where clause, in which case, all you need do is eliminate the subselect and the rank function is implicitly over the set of records matching that predicate.
select
cluster_id
,feed_id
,pub_date
,rank() over (order by pub_date asc) as rank
from url_info
where cluster_id = 9876 and feed_id = 1234;
If what you really wanted was the rank within the cluster, regardless of the feed_id, you can rank in a subselect which filters to that cluster:
select ranked.*
from (
select
cluster_id
,feed_id
,pub_date
,rank() over (order by pub_date asc) as rank
from url_info
where cluster_id = 9876
) as ranked
where feed_id = 1234;
Sharing another example of DENSE_RANK() of PostgreSQL.
Find top 3 students sample query.
Reference taken from this blog:
Create a table with sample data:
CREATE TABLE tbl_Students
(
StudID INT
,StudName CHARACTER VARYING
,TotalMark INT
);
INSERT INTO tbl_Students
VALUES
(1,'Anvesh',88),(2,'Neevan',78)
,(3,'Roy',90),(4,'Mahi',88)
,(5,'Maria',81),(6,'Jenny',90);
Using DENSE_RANK(), Calculate RANK of students:
;WITH cteStud AS
(
SELECT
StudName
,Totalmark
,DENSE_RANK() OVER (ORDER BY TotalMark DESC) AS StudRank
FROM tbl_Students
)
SELECT
StudName
,Totalmark
,StudRank
FROM cteStud
WHERE StudRank <= 3;
The Result:
studname | totalmark | studrank
----------+-----------+----------
Roy | 90 | 1
Jenny | 90 | 1
Anvesh | 88 | 2
Mahi | 88 | 2
Maria | 81 | 3
(5 rows)