Pipelinedb: How to group stream data by each N minutes in continuous view - pipelinedb

How to group data from pipelinedb's stream by each N minutes in continuous view select?
Pipelinedb's stream gets data about the events that comes from a many remote hosts. I need to group this events by type, ip and time intervals in 5 minutes, for example, and count them.
So on input I have (very roughly):
time | ip | type
------------------------------------
22:35 | 111.111.111.111 | page_open <-- new interaval, ends in 22:40
22:36 | 111.111.111.111 | page_open
22:37 | 111.111.111.111 | page_close
22:42 | 111.111.111.111 | page_close <-- event comes in next interval, ends in 22:45
22:42 | 222.111.111.111 | page_open
22:43 | 222.111.111.111 | page_open
22:44 | 222.111.111.111 | page_close
22:44 | 111.111.111.111 | page_open
And what must be in continuous view select:
time | ip | type | count
---------------------------------------------
22:40 | 111.111.111.111 | page_open | 2
22:40 | 111.111.111.111 | page_close | 1
22:45 | 111.111.111.111 | page_open | 1
22:45 | 111.111.111.111 | page_close | 1
22:45 | 222.111.111.111 | page_open | 2
22:45 | 222.111.111.111 | page_close | 1
p.s.
Sorry for my english

You can use the date_round(column, interval) [0] function for that. For example,
CREATE CONTINUOUS VIEW bucketed AS
SELECT date_round(time, '5 minutes') AS bucket, COUNT(*)
FROM input_stream GROUP BY bucket;
[0] http://docs.pipelinedb.com/builtin.html?highlight=date_round

Related

SQL Server 2008 R2 - converting columns to rows and have all values in one column

I am having a hard time trying to wrap my head around the pivot/unpivot concepts and hoping someone can help or give me some guidance on how to approach my problem.
Here is a simplified sample table I have
+-------+------+------+------+------+------+
| SAUID | COM1 | COM2 | COM3 | COM4 | COM5 |
+-------+------+------+------+------+------+
| 1 | 24 | 22 | 100 | 0 | 45 |
| 2 | 34 | 55 | 789 | 23 | 0 |
| 3 | 33 | 99 | 5552 | 35 | 4675 |
+-------+------+------+------+------+------+
The end result I am looking for a table result similar below
+-------+-----------+-------+
| SAUID | OCCUPANCY | VALUE |
+-------+-----------+-------+
| 1 | COM1 | 24 |
| 1 | COM2 | 22 |
| 1 | COM3 | 100 |
| 1 | COM4 | 0 |
| 1 | COM5 | 45 |
| 2 | COM1 | 34 |
| 2 | COM2 | 55 |
| 2 | COM3 | 789 |
| 2 | COM4 | 23 |
| 2 | COM5 | 0 |
| 3 | COM1 | 33 |
| 3 | COM2 | 99 |
| 3 | COM3 | 5552 |
| 3 | COM4 | 35 |
| 3 | COM5 | 4675 |
+-------+-----------+-------+
Im looking around but most of the examples seem to use pivot but having a hard time trying to wrap that around my case as I need the values all in one column.
I hoping to experiment with some hardcoding to get fimilar with my example but my actual table columns are ~100 with varying #s of SAUID per table and looks like it will require dynamic sql?
Thanks for the help in advance.
Use UNPIVOT:
SELECT u.SAUID, u.OCCUPANCY, u.VALUE
FROM yourTable t
UNPIVOT
(
VALUE for OCCUPANCY in (COM1, COM2, COM3, COM4, COM5)
) u;
ORDER BY
u.SAUID, u.OCCUPANCY;
Demo

Aggregate all previous rows for a specific time difference

I have a Spark DataFrame with the following entries:
| order id | time | amt |
| 1 | 2017-10-01 12:00 | 100 |
| 2 | 2017-10-01 15:00 | 100 |
| 3 | 2017-10-01 17:00 | 100 |
| 4 | 2017-10-02 16:00 | 100 |
| 5 | 2017-10-02 23:00 | 100 |
I want to add a column amount_prev_24h that has, for each order id, the sum of amt for all orders in the last 24 hours.
| order id | time | amt | amt_24h
| 1 | 2017-10-01 12:00 | 100 | 0
| 2 | 2017-10-01 15:00 | 100 | 100
| 3 | 2017-10-01 17:00 | 100 | 200
| 4 | 2017-10-02 16:00 | 100 | 100
| 5 | 2017-10-02 23:00 | 100 | 100
How would I go about doing it?
This is a pyspark code and similar to scala API.
df = df.withColumn('time_uts', unix_timestamp('time', format='yyyy-MM-dd HH:mm'))
df = df.withColumn('amt_24h', sum('amt').over(Window.orderBy('time_uts').rangeBetween(-24 * 3600, -1))).fillna(0, subset='amt_24h')
I hope this may help you.

PostgreSQL Crosstab generate_series of weeks for columns

From a table of "time entries" I'm trying to create a report of weekly totals for each user.
Sample of the table:
+-----+---------+-------------------------+--------------+
| id | user_id | start_time | hours_worked |
+-----+---------+-------------------------+--------------+
| 997 | 6 | 2018-01-01 03:05:00 UTC | 1.0 |
| 996 | 6 | 2017-12-01 05:05:00 UTC | 1.0 |
| 998 | 6 | 2017-12-01 05:05:00 UTC | 1.5 |
| 999 | 20 | 2017-11-15 19:00:00 UTC | 1.0 |
| 995 | 6 | 2017-11-11 20:47:42 UTC | 0.04 |
+-----+---------+-------------------------+--------------+
Right now I can run the following and basically get what I need
SELECT COALESCE(SUM(time_entries.hours_worked),0) AS total,
time_entries.user_id,
week::date
--Using generate_series here to account for weeks with no time entries when
--doing the join
FROM generate_series( (DATE_TRUNC('week', '2017-11-01 00:00:00'::date)),
(DATE_TRUNC('week', '2017-12-31 23:59:59.999999'::date)),
interval '7 day') as week LEFT JOIN time_entries
ON DATE_TRUNC('week', time_entries.start_time) = week
GROUP BY week, time_entries.user_id
ORDER BY week
This will return
+-------+---------+------------+
| total | user_id | week |
+-------+---------+------------+
| 14.08 | 5 | 2017-10-30 |
| 21.92 | 6 | 2017-10-30 |
| 10.92 | 7 | 2017-10-30 |
| 14.26 | 8 | 2017-10-30 |
| 14.78 | 10 | 2017-10-30 |
| 14.08 | 13 | 2017-10-30 |
| 15.83 | 15 | 2017-10-30 |
| 8.75 | 5 | 2017-11-06 |
| 10.53 | 6 | 2017-11-06 |
| 13.73 | 7 | 2017-11-06 |
| 14.26 | 8 | 2017-11-06 |
| 19.45 | 10 | 2017-11-06 |
| 15.95 | 13 | 2017-11-06 |
| 14.16 | 15 | 2017-11-06 |
| 1.00 | 20 | 2017-11-13 |
| 0 | | 2017-11-20 |
| 2.50 | 6 | 2017-11-27 |
| 0 | | 2017-12-04 |
| 0 | | 2017-12-11 |
| 0 | | 2017-12-18 |
| 0 | | 2017-12-25 |
+-------+---------+------------+
However, this is difficult to parse particularly when there's no data for a week. What I would like is a pivot or crosstab table where the weeks are the columns and the rows are the users. And to include nulls from each (for instance if a user had no entries in that week or week without entries from any user).
Something like this
+---------+---------------+--------------+--------------+
| user_id | 2017-10-30 | 2017-11-06 | 2017-11-13 |
+---------+---------------+--------------+--------------+
| 6 | 4.0 | 1.0 | 0 |
| 7 | 4.0 | 1.0 | 0 |
| 8 | 4.0 | 0 | 0 |
| 9 | 0 | 1.0 | 0 |
| 10 | 4.0 | 0.04 | 0 |
+---------+---------------+--------------+--------------+
I've been looking around online and it seems that "dynamically" generating a list of columns for crosstab is difficult. I'd rather not hard code them, which seems weird to do anyway for dates. Or use something like this case with week number.
Should I look for another solution besides crosstab? If I could get the series of weeks for each user including all nulls I think that would be good enough. It just seems that right now my join strategy isn't returning that.
Personally I would use a Date Dimension table and use that table as the basis for the query. I find it far easier to use tabular data for these types of calculations as it leads to SQL that's easier to read and maintain. There's a great article on creating a Date Dimension table in PostgreSQL at https://medium.com/#duffn/creating-a-date-dimension-table-in-postgresql-af3f8e2941ac, though you could get away with a much simpler version of this table.
Ultimately what you would do is use the Date table as the base for the SELECT cols FROM table section and then join against that, or probably use Common Table Expressions, to create the calculations.
I'll write up a solution to that if you would like demonstrating how you could create such a query.

Crosstab function and Dates PostgreSQL

I had to create a cross tab table from a Query where dates will be changed into column names. These order dates can be increase or decrease as per the dates passed in the query. The order date is in Unix format which is changed into normal format.
Query is following:
Select cd.cust_id
, od.order_id
, od.order_size
, (TIMESTAMP 'epoch' + od.order_date * INTERVAL '1 second')::Date As order_date
From consumer_details cd,
consumer_order od,
Where cd.cust_id = od.cust_id
And od.order_date Between 1469212200 And 1469212600
Order By od.order_id, od.order_date
Table as follows:
cust_id | order_id | order_size | order_date
-----------|----------------|---------------|--------------
210721008 | 0437756 | 4323 | 2016-07-22
210721008 | 0437756 | 4586 | 2016-09-24
210721019 | 10749881 | 0 | 2016-07-28
210721019 | 10749881 | 0 | 2016-07-28
210721033 | 13639 | 2286145 | 2016-09-06
210721033 | 13639 | 2300040 | 2016-10-03
Result will be:
cust_id | order_id | 2016-07-22 | 2016-09-24 | 2016-07-28 | 2016-09-06 | 2016-10-03
-----------|----------------|---------------|---------------|---------------|---------------|---------------
210721008 | 0437756 | 4323 | 4586 | | |
210721019 | 10749881 | | | 0 | |
210721033 | 13639 | | | | 2286145 | 2300040

Postgresql Use count on multiple columns

I have two tables. The first generate the condition for counting records in the second. The two tables are linked by a relation of 1:1 by Timestamp.
The problem is that the second table have many columns, and we need a count for each column that match the condition in the first column.
Example:
Tables met and pot
CREATE TABLE met (
tstamp timestamp without time zone NOT NULL,
h1_rad double precision,
CONSTRAINT met_pkey PRIMARY KEY (tstamp)
)
CREATE TABLE pot (
tstamp timestamp without time zone NOT NULL,
c1 double precision,
c2 double precision,
c3 double precision,
CONSTRAINT met_pkey PRIMARY KEY (tstamp)
)
REALLY pot have 108 columns from c1 to c108.
Tables values:
+ Table met + + Table pot +
+----------------+--------+--+----------------+------+------+------+
| tstamp | h1_rad | | tstamp | c1 | c2 | c3 |
+----------------+--------+--+----------------+------+------+------+
| 20150101 00:00 | 0 | | 20150101 00:00 | 5,5 | 3,3 | 15,6 |
| 20150101 00:05 | 1,8 | | 20150101 00:05 | 12,8 | 15,8 | 1,5 |
| 20150101 00:10 | 15,4 | | 20150101 00:10 | 25,4 | 4,5 | 1,4 |
| 20150101 00:15 | 28,4 | | 20150101 00:15 | 18,3 | 63,5 | 12,5 |
| 20150101 00:20 | 29,4 | | 20150101 00:20 | 24,5 | 78 | 17,5 |
| 20150101 00:25 | 13,5 | | 20150101 00:25 | 12,8 | 5,4 | 18,4 |
| 20150102 00:00 | 19,5 | | 20150102 00:00 | 11,1 | 25,6 | 6,5 |
| 20150102 00:05 | 2,5 | | 20150102 00:05 | 36,5 | 21,4 | 45,2 |
| 20150102 00:10 | 18,4 | | 20150102 00:10 | 1,4 | 35,5 | 63,5 |
| 20150102 00:15 | 20,4 | | 20150102 00:15 | 18,4 | 23,4 | 8,4 |
| 20150102 00:20 | 6,8 | | 20150102 00:20 | 16,8 | 12,5 | 18,4 |
| 20150102 00:25 | 17,4 | | 20150102 00:25 | 25,8 | 23,5 | 9,5 |
+----------------+--------+--+----------------+------+------+------+
What i need is the number of rows of pot where value is higher than 15 when in met the value is higher than 15 with the same timestamp, grouped by day.
With the data supplied we need something like:
+----------+----+----+----+
| day | c1 | c2 | c3 |
+----------+----+----+----+
| 20150101 | 3 | 2 | 1 |
| 20150102 | 2 | 4 | 1 |
+----------+----+----+----+
How can i get this ?
Is this possible with a single query even with subquerys ?
Actually the raw data is stored every minute in others tables. The tables met and pot are summarized and filtered tables for performance.
If necessary, i can create tables with data summarized by days if this simplify the solution.
Thanks
P.D.
Sorry for my english
You can solve this with some CASE statements. Test for both conditions, and if true return a 1. Then SUM() the results using a GROUP BY on the timestamp converted to a date to get your total:
SELECT
date(met.tstamp),
SUM(CASE WHEN met.h1_rad > 15 AND pot.c1 > 15 THEN 1 END) as C1,
SUM(CASE WHEN met.h1_rad > 15 AND pot.c2 > 15 THEN 1 END) as C2,
SUM(CASE WHEN met.h1_rad > 15 AND pot.c3 > 15 THEN 1 END) as C3
FROM
met INNER JOIN pot ON met.tstamp = pot.tstamp
GROUP BY date(met.tstamp)