TimescaleDB query to select rows where column value changed from previous row - postgresql

Just recently started using TimescaleDB with Postgres to handle most requests for data.
However I'm running into an issue where I have a horribly inefficient request for time series of data.
It's a data series that can be any length of time, with specific Integer values.
Most of the time the value will be the same unless there's an anomaly. So rather than fetching +10,000 rows of data. I would like to aggregate this into "time blocks".
Let's say there 97 items in a row where the value is 100 (new item for every 5 minutes) #98 the value is 48 for 5 items in a row and then it goes back up to 100 for another 2,900 rows.
I don't want to fetch 3002 items to display this data. I should only need to fetch 3 items.
1 item that says the value is 100 from a startDate
1 item that says the value is 48 from a startDate after #1
1 item that says the value is 100 again from a startDate after #2
But I'm having some trouble figuring out how I can do this with timescaledb.
basically, if the value is the same as the last value, aggregate it. That's all I need it to do.
Does anyone know how to construct a VIEW for this kind of situation in timescaleDB using continuous aggregation (or if there's a faster way) to fetch this?

You can achieve the desired result with window functions and a subselect:
SELECT time, value FROM (
SELECT
time,
value,
value - LAG(value) OVER (ORDER BY time) as diff
FROM hypertable) ht
WHERE diff IS NULL OR diff != 0;
You use a window function to calculate the diff to the previous row and then filter all the rows where the diff is 0 in the outer query.

Related

how can I stream rows, in a loop, from Postgres? (F#/.NET but any other language is fine)

I have a set of data that looks like this:
(
instrument varchar(20) NOT NULL,
ts timestamp without time zone NOT NULL,
price float8 NOT NULL,
quantity float8 NOT NULL,
direction int NOT NULL
);
and I'd like to keep in my app the last hour of data; so upon startup, query everything where ts >= now - 1h, and then keep a loop where I query from the last row received to 'now'.
We're talking about roughly 1.5M rows to fetch at startup.
The issue is that the timestamp is not unique: you can have multiple rows with the same timestamp.
I am requesting an update every second with a limit of 50k, and it usually produces 200-500 rows; but the startup is providing batches of 50k rows until it catches up with the new data.
Should I:
add an id, find the id of now - 1h, and request records with an id higher than the last one received?
roll back the last timestamp received by 1s and deal with duplicates
something better I didn't think about (I'm not very knowledgeable with SQL DBs)
You have the right idea.
If there is a single writer (which is to say, a single client INSERTing to this table), you can use an autoincrementing ID to avoid having do deal with duplicates. Add an ID column, query for the last hour, keep your largest-seen ID in memory, and then run your loop of id>largest_seen_id every second. This is pretty standard, but does rely on IDs always becoming visible to the querier in increasing order, which isn't something the DB can guarantee in the general case.
If there are multiple writers, it's possible for this approach to skip rows, as a higher ID might be committed while another client still has it's lower-id row in an open transaction, and not visible to your query. Your next iteration would have the higher id as the lower bound, so you never end up seeing the committed lower-ID row. The easiest way of dealing with that is what you're already thinking of with your second option: query for extra rows and ignore the ones you already have. If network transfer is an issue, you can do the ignoring in the WHERE of the loop query you're running, so you're only "wasting" the size of the already-seen row IDs, instead of all the row data:
SELECT *
FROM my_appending_table
WHERE timestamp > ('2021-08-24 00:10:07' - interval '5 seconds')
AND NOT id=ANY({99101,99105,99106})
ORDER BY id asc
LIMIT 50000
Where 2021-08-24 00:10:07 is the max timestamp you currently have (probably fine to use the timestamp of the max-ID row; not guaranteed to be the same but basically always will be), 5 seconds is a fuzz to ensure all other writers have committed, and 99101,99105,99106 are the IDs you've already seen for rows with timestamp 2021-08-24 00:10:02 (:07 - 5) or later.

Google Sheets QUERY returns non-matching records

In this workbook 1 the second, "DMK Recent" sheet is populated by a query that returns two records that do not meet the [C > date'2019-12-16'] criterion. These are the last two rows on the sheet. To make things more mysterious, those records display values in the C "Date" field that do not match the source data.
If I reduce the size of the source data, only correctly selected records are returned. Is there some source data size beyond which the QUERY() function loses its grip? Many thanks for any light on this.
your query formula:
=QUERY(xFRm!A5:X,
"select A,B,C,D,F,E,K,M,G,J
where C > date'2019-12-16'
and G='THB'
order by C,D", 1)
is is working flawlessly.
the culprits in your case are wrongly entered dates
go and fix rows 4539 and 4540

Tableau isNull then 0 calculated field

I have my tableau workbook and I'm currently counting by a field called ID - COUNT([Id]) - while this is great, on days with no activity my dashboard doesn't show ANYTHING and I want it to show zero if there was no activity - so I do I change this to count but also replace null with 0 (zero)?
First make sure you understand what Count([ID]) does. It returns the number records in the data source that have a non-null value in the column [ID].
Count() never evaluates to null. But if you have no rows at all in your data after filtering, then you'll get an empty result set -- i.e. view data will not have any summary data to show at all - whether null or zero.
Wrapping COUNT() in a call to ISNULL() or ZN() won't help in that case.
The solution is to make sure you have at least one data row per day, even if all other fields besides the date are null. Aggregation functions ignore nulls so padding your data like this should not disturb your results. The simplest way is to make a calendar table that has one row per day with nulls in most columns. Then use a Union to combine the calendar with your original data source. Then Count(ID) will return zero on days where there are no other records besides the calendar entry.
You can also get similar results using data blending, although with a bit more complexity.

Tableau Future and Current References

Tough problem I am working on here.
I have a table of CustomerIDs and CallDates. I want to measure whether there is a 'repeat call' within a certain period of time (up to 30 days).
I plan on creating a parameter called RepeatTime which is a range from 0 - 30 days, so the user can slide a scale to see the number/percentage of total repeats.
In Excel, I have this working. I sort CustomerID in order and then sort CallDate from earliest to latest. I then have formulas like:
=IF(AND(CurrentCustomerID = FutureCustomerID, FutureCallDate - CurrentCallDate <= RepeatTime), 1,0)
CurrentCustomerID = the current row, and the FutureCustomerID = the following row (so it is saying if the customer ID is the same).
FutureCallDate = the following row and the CurrentCallDate = the current row. It is subtracting the future call time from the first call time to measure the time in between.
The goal is to be able to see, dynamically, how many customers called in for a specific reason within maybe 4 hours or 1 day or 5 days, etc. All of the way up until 30 days (this is our actual metric but it is good to see the calls which are repeats within a shorter time frame so we can investigate).
I had a similar problem, see here for detailed version Array calculation in Tableau, maxif routine
In your case, that is basically the same thing as mine, so you could apply that solution, but I find it easier to understand the one I'm about to give, I would do:
1) Create a calculated field called RepeatTime:
DATEDIFF('day',MAX(CallDates),LOOKUP(MAX(CallDates),-1))
This will calculated how many days have passed since the last call to the current. You can add a IFNULL not to get Null values for the first entry.
2) Drag CustomersID, CallDates and RepeatTime to the worksheet (can be on the marks tab, don't need to be on rows or column).
3) Configure the table calculation of RepeatTIme, Compute using Advanced..., partitioning CustomersID, Adressing CallDates
Also Sort by Field CallDates, Maximum, Ascending.
This will guarantee the table calculation works properly
4) Now you have a base that you can use for what you need. You can either export it to csv or mdb and connect to it.
The best approach, actually, is to have this RepeatTime field calculated outside Tableau, on your database, so it's already there when you connect to it. But this is a way to use Tableau to do the calculation for you.
Unfortunately there's no direct way to do this directly with your database.

SQL Sum and Group By for a running Tally?

I'm completely rewriting my question to simplify it. Sorry if you read the prior version. (The previous version of this question included a very complex query example that created a distraction from what I really need.) I'm using SQL Express.
I have a table of lessons.
LessonID StudentID StudentName LengthInMinutes
1 1 Chuck 120
2 2 George 60
3 2 George 30
4 1 Chuck 60
5 1 Chuck 10
These would be ordered by date. (Of course the actual table is thousands of records with dates and other lesson-related data but this is a simplification.)
I need to query this table such that I get all rows (or a subset of rows by a date range or by student), but I need my query to add a new column we might call PriorLessonMinutes. That is, the sum of all minutes of all lessons for the same student in lessons of PRIOR dates only.
So the query would return:
LessonID StudentID StudentName LengthInMinutes PriorLessonMinutes
1 1 Chuck 120 0
2 2 George 60 0
3 2 George 30 60 (The sum Length from row 2 only)
4 1 Chuck 60 120 (The sum Length from row 1 only)
5 1 Chuck 10 180 (The sum of Length from rows 1 and 4)
In essence, I need a running tally of the sum of prior lesson minutes for each student. Ideally the tally shouldn't include the current row, but if it does, no big deal as I can do subtraction in the code that receives the query.
Further, (and this is important) if I retrieve only a subset of records, (for example by a date range) PriorLessonMinutes must be a sum that considers rows that are NOT returned.
My first idea was to use SUM() and to GROUP BY Student, but that isn't right because unless I'm mistaken it would include a sum of minutes for all rows for each student, including rows that come after the row which aren't relevant to the sum I need.
OPTIONS I'M REJECTING: I could scan through all rows in my code that receives it, (although this would force me to retrieve all rows unnecessarily) but that's obviously inefficient. I could also put a real data field in there and populate it, but this too presents problems when other records are deleted or altered.
I have no idea how to write such a query together. Any guidance?
This is a great opportunity to use Windowed Aggregates. The trick is that you need SQL Server 2012 Express. If you can get it, then this is the query you are looking for:
select *,
sum(LengthInMinutes)
over (partition by StudentId order by LessonId
rows between unbounded preceding and 1 preceding)
as PriorLessonMinutes
from Lessons
Note that it returns NULLs instead of 0s (zeroes). If you insist on zeroes, use COALESCE function to turn NULLs into zeroes.
I suggest using a nested query to limit the number of rows returned:
select * from
(
select *,
sum(LengthInMinutes)
over (partition by StudentId order by LessonId
rows between unbounded preceding and 1 preceding)
as PriorLessonMinutes
from Lessons
) as NestedLessons
where LessonId > 3 -- this is an example of a filter
This way the filter is applied after the aggregation is complete.
Now, if you want to apply a filter that doesn't affect the aggregation (like only querying data for a certain student), you should apply the filter to the inner query, as pruning the rows that don't affect the computation early (like data for other students) will improve the performance.
I feel the following code will serve your purpose.Check it:-
select Students.StudentID ,Students.First, Students.Last,sum(Lessons.LengthInMinutes)
as TotalPriorMinutes from lessons,students
where Lessons.StartDateTime < getdate()
and Lessons.StudentID = Students.StudentID
and StartDateTime >= '20090130 00:00:00' and StartDateTime < '20790101 00:00:00'
group by Students.StudentID ,Students.First, Students.Last