How can I select only data within a specific window in KSQL? - apache-kafka

I have a table with tumbling window, e.g.
CREATE TABLE total_transactions_per_1_days AS
SELECT
sender,
count(*) AS count,
sum(amount) AS total_amount,
histogram(recipient) AS recipients
FROM
completed_transactions
WINDOW TUMBLING (
SIZE 1 DAYS
)
Now I need to only select data from the current window, i.e. windowstart <= current time and windowend <= current time. Is it possible? I could not find any example.

Depends what you mean when you say 'select data' ;)
ksqlDB supports two main query types, (see https://docs.ksqldb.io/en/latest/concepts/queries/).
If what you want is a pull query, i.e. a traditional sql query where you want to pull back the current window as a one time result, then what you want may be possible, though pull queries are a recent feature and not fully featured yet. As of version 0.10 you can only look up a known key. For example, if sender is the key of the table, you could run a query like:
SELECT * FROM total_transactions_per_1_days
WHERE sender = some_value
AND WindowStart <= UNIX_TIMESTAMP()
AND WindowEnd >= UNIX_TIMESTAMP();
This would require the table to have processed data with a timestamp close to the current wall clock time for it to pull back data, i.e. if the system was lagging, or if you were processing historic or delayed data, this would not work.
Note: the above query will work on ksqlDB v0.10. Your success on older versions may vary.
There are plans to extend the functionality of pull queries. So keep an eye for updates to ksqlDB.

Related

Union with Redshift native tables and external tables (Spectrum)

If I have a view that contains a union between a native table and external table like so (pseudocode):
create view vwPageViews as
select from PageViews
union all
select from PageViewsHistory
PageViews has for the last 2 years. External table has for older data than 2 years.
If a user selects from the view with filters for the last 6 months, how does RS Spectrum handle it - does it read the entire external table even though none will be returned (and accordingly cost us money for all of it)? (Assuming the s3 files are parquet based).
ex.
Select from vwPageViews where MyDate >= '01/01/2021'
What's the best approach for querying both cold and historical data using RS and Spectrum? Thanks!
How this will happen on Spectrum will depend on whether or not you have provided partitions for the data in S3. Without partitions (and a where clause on the partition) the Spectrum engines in S3 will have to read every file to determine if the needed data is in any of them. The cost of this will depend on the number and size of the files AND what format they are in. (CSV is more expensive than Parquet for example.)
The way around this is to partition the data in S3 and to have a WHERE clause on the partition value. This will exclude files from needing to be read when they don't match on the partition value.
The rub is in providing the WHERE clause for the partition as this will likely be less granular than the date or timestamp you using in your base data. For example if you partition on YearMonth (YYYYMM) and want to have a day level WHERE clause you will need to 2 parts to the WHERE clause - WHERE date_col >= 2015-07-12 AND part_col >= 201507. How to produce both WHERE conditions will depend on your solution around Redshift.

postgresSQL How to do a SELECT clause with an condition iterating through a range of values?

Hy everyone. This is my first post on Stack Overflow so sorry if it is clumpsy in any way.
I work in Python and make postgresSQL requests to a google BigQuery database. The data structure looks like this :
sample of data
where time is represented in nanoseconds, and is not regularly spaced (it is captured real-time).
What I want to do is to select, say, the mean price over a minute, for each minute in a time range that i would like to give as a parameter.
This time range is currently a list of timestamps that I build externally, and I make sure they are separated by one minute each :
[1606170420000000000, 1606170360000000000, 1606170300000000000, 1606170240000000000, 1606170180000000000, ...]
My question is : how can I extract this list of mean prices given that list of time intervals ?
Ideally I'd expect something like
SELECT AVG(price) OVER( PARTITION BY (time BETWEEN time_intervals[i] AND time_intervals[i+1] for i in range(len(time_intervals))) )
FROM table_name
but I know that doesn't make sense...
My temporary solution is to aggregate many SELECT ... UNION DISTINCT clauses, one for each minute interval. But as you can imagine, this is not very efficient... (I need up to 60*24 = 1440 samples)
Now there very well may already be an answer to that question, but since I'm not even sure about how to formulate it, I found nothing yet. Every link and/or tip would be of great help.
Many thanks in advance.
First of all, your sample data appears to be at millisecond resolution, and you are looking for averages at minute (sixty-second) resolution.
Please try this:
select div(time, 60000000000) as minute,
pair,
avg(price) as avg_price
from your_table
group by div(time, 60000000000) as minute, pair
If you want to control the intervals as you said in your comment, then please try something like this (I do not have access to BigQuery):
with time_ivals as (
select tick,
lead(tick) over (order by tick) as next_tick
from unnest(
[1606170420000000000, 1606170360000000000,
1606170300000000000, 1606170240000000000,
1606170180000000000, ...]) as tick
)
select t.tick, y.pair, avg(y.price) as avg_price
from time_ivals t
join your_table y
on y.time >= t.tick
and y.time < t.next_tick
group by t.tick, y.pair;

Logical Grouping in Bigquery

I am trying to group data in bigquery that goes beyond simple aggregation. However I am not sure if what I'm trying to do is possible.
The idea behind the data:
One employee will be logged in and can perform multiple transactions. hits.eventInfo captures all of this data but the only field that separates the transactions from one another is a flight_search field which is done to look up a person's records before a transaction (I also thought about using the resetting hitNumber as a transaction separator, but its not always a clean reset per transaction).
My question is, is it possible to group by the fullVisitorId+VisitId, date and this logic where we would have all of the array_agg reset each time the flight_search field is fired? Currently, all the transactional data is in going into one array instead of separate arrays per transaction. Its then impossible to tell which fields go with which transaction. Further, taking the max is supposed to give me the last updates in each transaction, but it just gives me the last transaction because they are all together.
Example of my query below. I have to use array_agg or something like it since the subqueries can only have one return
WITH eventData AS (
SELECT
CONCAT(fullVisitorId, ' ', CAST(VisitId AS string)) sessionId,
date AS date,
hit.hour AS checkinHour,
hit.minute AS checkinMin,
(SELECT ARRAY_AGG(hit.eventInfo.eventAction) FROM UNNEST(hits) hit WHERE hit.eventInfo.eventCategory = 'pnr') AS pnr,
(SELECT ARRAY_AGG(STRUCT(hit.eventInfo.eventAction)) AS val FROM UNNEST(hits) hit WHERE hit.eventInfo.eventCategory = 'submit_checkin') AS names
FROM
`web-analytics.192016109.ga_sessions_20191223`,
UNNEST(hits) AS hit
## group by sessionId, date, hit.eventInfo.eventCategory ='flight_search'
)
SELECT
sessionId,
date,
MAX(checkinHour) chkHr,
MAX(checkinMin) AS chkMin,
# end of transaction
MAX(pnr[ORDINAL(ARRAY_LENGTH(pnr))]) AS pnr,
names.eventAction AS pax_name
FROM
eventData,
UNNEST (names) AS names
GROUP BY
sessionId,
date,
pax_name
Technically if I add a group by here, everything will break because Ill be asked to then group by hour, min and then hits which is an array...
Example test data
This is the original eventData as it is fed in from Google Analytics to BigQuery. I have simplified the displayed eventCategories. This is where the inner query is sourcing. A transaction is completed after the submit_checkin event happens. As we can see though, there is one pnr (identifier) but multiple people are checked-in for that pnr.
This is a sample of the output from eventData looks like. As you can see, the pnrs are grouped in one array and the names are in one array. Its not directly possible to see which were together in which transaction.
Lastly, here is the whole query output. I wrote on the picture what the expected result is.
If you want to see which information was tracked in the same hit you should keep the relation between them. But it seems they are not in the same hit with eventCategory being 'pnr' one time and 'submit_checkin' the other time.
I'm not sure it's intentional but you're also cross joining the table with hits ... and then you're array_agg()-ing the hits array per hit again. That seems wrong.
If you're staying on session scope then there is no need to group anything, because the table already comes with 1 row = 1 session.
This query prepares another window function
SELECT
fullVisitorId,
visitstarttime,
date,
ARRAY(
SELECT AS STRUCT
hitNumber,
IF(eventInfo.eventCategory='flight_search'
AND
LAG(eventInfo.eventCategory) OVER (ORDER BY hitnumber ASC) = 'submit_checkin', 1, 0
) as breakInfo,
eventInfo,
hour,
minute
FROM UNNEST(hits) hit
WHERE hit.eventInfo.eventCategory IN ('pnr', 'submit_checkin', 'flight_search')
ORDER BY hitnumber ASC
) AS myhits1,
ARRAY(SELECT AS STRUCT
*,
SUM(breakInfo) OVER (order by hitnumber) as arrayId
FROM (SELECT
hitNumber,
IF(eventInfo.eventCategory='flight_search'
AND
LAG(eventInfo.eventCategory) OVER (ORDER BY hitnumber ASC) = 'submit_checkin', 1, 0
) as breakInfo,
eventInfo,
hour,
minute
FROM UNNEST(hits) hit
WHERE hit.eventInfo.eventCategory IN ('pnr', 'submit_checkin', 'flight_search')
ORDER BY hitnumber ASC
)) AS myhits2
FROM
`web-analytics.192016109.ga_sessions_20191223`
This gives you a number as id to group by. You only need to feed the output that gets fed to the array function to yet another sub-query that finally groups it into arrays using array_agg() and group by arrrayId.

Cassandra and cleaning old data with counter type

So I know that TTL is not available for counters because of design reasons and I've read https://issues.apache.org/jira/browse/CASSANDRA-2103 as well as some other SO questions regarding this but there seems to be no clear answer(unless I am missing something which is entirely plausible):
How do we elegantly handle the expiration of counters in Cassandra?
Example use case: page views on a specific day.
For this we might have a table such as
CREATE TABLE pageviews (page varchar, date varchar, views counter, PRIMARY KEY(page, date));
One year from now the information of how many views we had on one specific day is not very relevant (instead we might have aggregated it into a view/month table or similar) and we don't want unnecessary data hanging around in our db for no reason. Normally we would put a TTL on this and let Cassandra handle it for us - elegant! But since we aren't allowed to use TTL for counter tables this is not an option..
You also cant just run delete from pageviews where date > 'xxxx' since both key must be defined in the where clause.
You would first need to query all the page first then issue individual deletes, which is not scalable.
Is there any proper way of achieving this ?
Its significantly slower, but thats kinda the price if you dont want to manage the expiration yourself - you can use LWTs and actually insert TTL'd columns instead of updating a counter. ie:
CREATE TABLE pageviews (
page varchar,
date timestamp,
views int,
PRIMARY KEY(page, date))
WITH compaction = {'class': 'LeveledCompactionStrategy'};
To update a page view:
UPDATE pageviews USING TTL 604800
SET views = *12*
WHERE page = '/home' AND date = YYYY-MM-DD
IF views = *11*
if it fails, reread and try again. This can be very slow if high contention, but in that case you can do some batching per app, say only flush updates every 10 seconds or something and increment by more than 1 at a time
To see total in range of dates:
SELECT sum(views) FROM pageviews WHERE page='/home' and date >= '2017-01-01 00:00:00+0200' AND date <= '2017-01-13 23:59:00+0200'
Fastest approach would be to use counters and just have a job during a less busy time that deletes things older than X days.
Another idea if you are Ok with some % error, you can use a single counter per page and use forward decay to "expire" (make insignificant) old view increments, will still need a job to adjust landmark periodically though. This will not be as useful for looking at ranges though and will only give you an estimate of "total so far".
If you don't need date range queries, you can use a partition key of page % X, date and a clustering key of page.
Then for each date you wish to discard, you can delete partitions 0 through X - 1 with X delete statements.

Volume of an Incident Queue at a Point in Time

I have an incident queue, consisting of a record number-string, the open time - datetime, and a close time-datetime. The records go back a year or so. What I am trying to get is a line graph displaying the queue volume as it was at 8PM each day. So if a ticket was opened before 8PM on that day or anytime on a previous day, but not closed as of 8, it should be contained in the population.
I tried the below, but this won't work because it doesn't really take into account multiple days.
If DATEPART('hour',[CloseTimeActual])>18 AND DATEPART('minute',[CloseTimeActual])>=0 AND DATEPART('hour',[OpenTimeActual])<=18 THEN 1
ELSE 0
END
Has anyone dealt with this problem before? I am using Tableau 8.2, cannot use 9 yet due to company license so please only propose 8.2 solutions. Thanks in advance.
For tracking history of state changes, the easiest approach is to reshape your data so each row represents a change in an incident state. So there would be a row representing the creation of each incident, and a row representing each other state change, say assignment, resolution, cancellation etc. You probably want columns to represent an incident number, date of the state change and type of state change.
Then you can write a calculated field that returns +1, -1 or 0 to to express how the state change effects the number of currently open incidents. Then you use a running total to see the total number open at a given time.
You may need to show missing date values or add padding if state changes are rare. For other analytical questions, structuring your data with one record per incident may be more convenient. To avoid duplication, you might want to use database views or custom SQL with UNION ALL clauses to allow both views of the same underlying database tables.
It's always a good idea to be able to fill in the blank for "Each record in my dataset represents exactly one _________"
Tableau 9 has some reshaping capability in the data connection pane, or you can preprocess the data or create a view in the database to reshape it. Alternatively, you can specify a Union in Tableau with some calculated fields (or similarly custom SQL with a UNION ALL clause). Here is a brief illustration:
select open_date as Date,
"OPEN" as Action,
1 as Queue_Change,
<other columns if desired>
from incidents
UNION ALL
select close_date as Date,
"CLOSE" as Action,
-1 as Queue_Change,
<other columns if desired>
from incidents
where close_date is not null
Now you can use a running sum for SUM(Queue_Change) to see the number of open incidents over time. If you have other columns like priority, department, type etc, you can filter and group as usual in Tableau. This data source can be in addition to your previous one. You don't have ta have a single view of the data for every worksheet in your workbook. Sometimes you want a few different connections to the same data at different levels of detail or for perspectives.