Rolling up events to a minute in Apache Beam - apache-beam

So, I have a stream of data with this structure (I apologize it's in SQL)
CREATE TABLE github_events
(
event_id bigint,
event_type text,
event_public boolean,
repo_id bigint,
payload jsonb,
repo jsonb,
user_id bigint,
org jsonb,
created_at timestamp
);
In SQL, I would rollup this data up to a minute like this:
1.Create a roll-up table for this purpose:
CREATE TABLE github_events_rollup_minute
(
created_at timestamp,
event_count bigint
);
2.And populate with INSERT/SELECT:
INSERT INTO github_events_rollup_minute(
created_at,
event_count
)
SELECT
date_trunc('minute', created_at) AS created_at,
COUNT(*)the AS event_count
FROM github_events
GROUP BY 1;
In Apache Beam, I am trying to roll-up events to a minute, i.e. count the total number of events received in that minute as per event's timestamp field.
Timestamp(in YYYY-MM-DDThh:mm): event_count
So, later in the pipeline if we receive more events with the same overlapping timestamp (due to the event receiving delays as the customer might be offline), we just need to take the roll-up count and increment the count for that timestamp.
This will allow us to simply increment the count for YYYY-MM-DDThh:mm by event_count in the application.
Assuming, events might be delayed but they'll always have the timestamp field.
I would like to accomplish the same thing in Apache Beam. I am very new to Apache Beam, I feel that I am missing something in Beam that would allow me to accomplish this. I've read the Apache Beam Programming Guide multiple times.

Take a look at the sections on Windowing and Triggers. What you're describing is fixed-time windows with allowed late data. The general shape of the pipeline sounds like:
Read input github_events data
Window into fixed windows of 1 minute, allowing late data
Count events per-window
Output the result to github_events_rollup_minute
The WindowedWordCount example project demonstrates this pattern.

Related

KSQL UNIX_TIMESTAMP function is not dinamic on streams created with queries

I'm creating the following stream:
CREATE STREAM riderLocations (profileId VARCHAR, latitude DOUBLE, longitude DOUBLE, publishtime VARCHAR)
WITH (kafka_topic='locations', value_format='json', partitions=1);
and then this other one:
CREATE STREAM INVENTORY WITH (KAFKA_TOPIC='locations_in')
AS select * FROM riderLocations
where STRINGTOTIMESTAMP(publishtime, 'yyyy-MM-dd HH:mm:ss.SSSZ') < UNIX_TIMESTAMP();
when I execute the command: select * from inventory emit changes;
it only presents the messages that had a publish date smaller than the moment when the inventory stream was created.
How can I force the unix_timestamp value to update and update my stream inventory?
The timestamp is evaluated at query creation time, not runtime.
If you actively want messages less than "now", you'll need to include that as a WHERE clause for each SELECT query you run

Update all time index in a TimescaleDB/PostgreSQL hypertable?

I am using an open-source time-series database named TimescaleDB ( based on PostgreSQL ).
Assuming this table :
CREATE TABLE order (
time TIMESTAMPTZ NOT NULL,
product text,
price DOUBLE PRECISION,
qty DOUBLE PRECISION
);
Next, I transform it into a hypertable with :
SELECT create_hypertable('order', 'time');
Next, insert some data (more than 5 millions rows) :
2020-01-01T12:23:52.1235,product1,10,1
2020-01-01T12:23:53.5496,product2,52,7
2020-01-01T12:23:55.3512,product1,23,5
[...]
I need then to update data to get a time index minus 1h interval, like this :
2020-01-01T11:23:52.1235,product1,10,1
2020-01-01T11:23:53.5496,product2,52,7
2020-01-01T11:23:55.3512,product1,23,5
[...]
What is the most efficient method (duration) to alter the time index in this hypertable in order to remove a 1h interval on all data inside the order table ?
not sure if partitioning is available in Timescale, that would ease the process by putting partitions based on the time-range or even date-range.
See if this is one of the options available that way you can just drop the partition based off of a range and voila!

Handling of multiple queries as one result

Lets say I have this table
CREATE TABLE device_data_by_year (
year int,
device_id uuid,
sensor_id uuid,
nano_since_epoch bigint,
unit text,
value double,
source text,
username text,
PRIMARY KEY (year, device_id, nano_since_epoch,sensor_id)
) WITH CLUSTERING ORDER BY (device_id desc, nano_since_epoch desc);
I need to query data for a particular device and sensor in a period between 2017 and 2018. In this case 2 queries will be issued:
select * from device_data_by_year where year = 2018 AND device_id = ? AND sensor_id = ? AND nano_since_epoch >= ? AND nano_since_epoch <= ?
select * from device_data_by_year where year = 2018 AND device_id = ? AND sensor_id = ? AND nano_since_epoch >= ? AND nano_since_epoch <= ?
Currently I iterate over the resultsets and build a List with all the results. I am aware that this could (and will) run into OOM problems some day. Is there a better approach, how to handle / merge query results into one set?
Thanks
You can use IN to specify a list of years, but this is not very optimal solution - because the year field is partition key, then most probably the data will be on different machines, so one of the node will work as "coordinator", and will need to ask another machine for results, and aggregate data. From performance point of view, 2 async requests issued in parallel could be faster, and then do the merge on client side.
P.S. your data model have quite serious problems - you partition by year, this means:
Data isn't very good distributed across the cluster - only N=RF machines will hold the data;
These partitions will be very huge, even if you get only hundred of devices, reporting one measurement per minute;
Only one partition will be "hot" - it will receive all data during the year, and other partitions won't be used very often.
You can use months, or even days as partition key to decrease the size of partition, but it still won't solve the problem of the "hot" partitions.
If I remember correctly, Data Modelling course at DataStax Academy has an example of data model for sensor network.
Changed the table structure to:
CREATE TABLE device_data (
week_first_day timestamp,
device_id uuid,
sensor_id uuid,
nano_since_epoch bigint,
unit text,
value double,
source text,
username text,
PRIMARY KEY ((week_first_day, device_id), nano_since_epoch, sensor_id)
) WITH CLUSTERING ORDER BY (nano_since_epoch desc, sensor_id desc);
according to #AlexOtt proposal. Some changes to the application logic are required - for example findAllByYear needs to iterate over single weeks now.
Coming back to the original question: would you rather send 52 queries (getDataByYear, one query per week) oder would you use the IN operator here?

Looping through unique dates in PostgreSQL

In Python (pandas) I read from my database and then I use a pivot table to aggregate data each day. The raw data I am working on is about 2 million rows per day and it is per person and per 30 minutes. I am aggregating it to be daily instead so it is a lot smaller for visualization.
So in pandas, I would read each date into memory and aggregate it and then load it into a fresh table in postgres.
How can I do this directly in postgres? Can I loop through each unique report_date in my table, groupby, and then append it to another table? I am assuming doing it in postgres would be fast compared to reading it over a network in python, writing a temporary .csv file, and then writing it again over the network.
Here's an example: Suppose that you have a table
CREATE TABLE post (
posted_at timestamptz not null,
user_id integer not null,
score integer not null
);
representing the score various user have earned from posts they made in SO like forum. Then the following query
SELECT user_id, posted_at::date AS day, sum(score) AS score
FROM post
GROUP BY user_id, posted_at::date;
will aggregate the scores per user per day.
Note that this will consider that the day changes at 00:00 UTC (like SO does). If you want a different time, say midnight Paris time, then you can do it like so:
SELECT user_id, (posted_at AT TIME ZONE 'Europe/Paris')::date AS day, sum(score) AS score
FROM post
GROUP BY user_id, (posted_at AT TIME ZONE 'Europe/Paris')::date;
To have good performace for the above queries, you might want to create a (computed) index on (user_id, posted_at::date), or similarly for the second case.

Join time series Cassandra tables in Spark

I have two tables (agg_count_1 and agg_count_2) in Cassandra both with the same schema:
CREATE TABLE agg_count_1 (
pk_1 text,
pk_2 text,
pk_3 text,
window_start timestamp,
count counter,
PRIMARY KEY (( pk_1, pk_2, pk_3 ), window_start)
) WITH CLUSTERING ORDER BY ( window_start DESC )
window_start is a timestamp rounded to nearest 15 minutes which means its value is exactly the same in both tables however rows for some time windows may be missing.
I would like to efficiently (inner) join these two tables on the primary key to a third table with very much the same schema and store value of agg_count_1.counter to counter_1 and agg_count_2.counter to counter_2 columns:
CREATE TABLE agg_joined (
pk_1 text,
pk_2 text,
pk_3 text,
window_start timestamp,
int counter_1,
int counter_2,
PRIMARY KEY (( pk_1, pk_2, pk_3 ), window_start)
) WITH CLUSTERING ORDER BY ( window_start DESC )
This can be done in many ways using combination of Scala, Spark and Spark-Cassandra connector features. What is the recommended way?
I would appreciate to hear about solutions to avoid. Joins are in general expensive but I would expect this kind of "zipping" of time series should be fairly efficient if you (actually me) don't do anything wrong.
Based on Spark-Cassandra documentation using joinWithCassandraTable sounds suboptimal because it executes a single query for every partition:
joinWithCassandraTable utilizes the java drive to execute a single query for every partition required by the source RDD so no un-needed data will be requested or serialized.