KSQL Rolling Sum - apache-kafka

W have created a ksql stream that have id, sale, event as parameters where "event" can have two values, "sale" and "endSale". We used the below command to create the stream.
CREATE STREAM sales (id BIGINT, sale BIGINT , data VARCHAR)
WITH (KAFKA_TOPIC='raw_topic', VALUE_FORMAT='json');
We want to aggregate on the "sale" parameter. So, we have used SUM(sale) to aggregate it. We have used the below command to create a table
CREATE TABLE sales_total WITH (VALUE_FORMAT='JSON', KEY_FORMAT='JSON') AS
SELECT ID, SUM(SALE) AS TOTAL_SALES
FROM SALES GROUP BY ID EMIT CHANGES;
Now, we want to keep the sum aggregating as long as "event = sale". And, when the "event = endSale", we want to publish the value of SUM(SALE) AS TOTAL_SALES to a different topic and make the value of "SUM(SALE) AS TOTAL_SALES" to 0;
How to achieve the above scenario?
Is there a way to achieve this using UDAF? Can we pass custom values to UDAF "aggregate" function?
According to this link
An aggregate function that takes N input rows and returns one output value. During the function call, the state is retained for all input records, which enables aggregating results. When you implement a custom aggregate function, it's called a User-Defined Aggregate Function (UDAF).
How to pass N input rows to the UDAF? Please share any link for example

Related

Create Table without data aggregation

I just started to use the ksqlDB Confluent feature, and it stood out that it is not possible to proceed with the following command: CREATE TABLE AS SELECT A, B, C FROM [STREAM_A] [EMIT CHANGES];
I wonder why this is not possible or if there's a way of doing it?
Data aggregation here is feeling a heavy process to a simple solution.
Edit 1: Source is a STREAM and not a TABLE.
The field types are:
String
Integers
Record
Let me share an example of the executed command that returns an error as a result.
CREATE TABLE test_table
WITH (KEY_FORMAT='JSON',VALUE_FORMAT='AVRO')
AS
SELECT id
, timestamp
, servicename
, content->assignedcontent
FROM created_stream
WHERE content->assignedcontent IS NOT NULL
[EMIT CHANGES];
create a table with a smaller dataset and fewer fields than the original topic
I think the confusion here is that you talk about a TABLE, but you're actually creating a STREAM. The two are different types of object.
A STREAM is an unbounded series of events - just like a Kafka topic. The only difference is that a STREAM has a declared schema.
A TABLE is state, for a given key. It's the same as KTable in Kafka Streams if you're familiar with that.
Both are backed by Kafka topics.
So you can do this - note that it's creating a STREAM not a TABLE
CREATE STREAM test_stream
WITH (KEY_FORMAT='JSON',VALUE_FORMAT='AVRO')
AS
SELECT id
, timestamp
, servicename
, content->assignedcontent
FROM created_stream
WHERE content->assignedcontent IS NOT NULL;
If you really want to create a TABLE then use the LATEST_BY_OFFSET aggregation, assuming you'd using id as your key:
CREATE TABLE test_table
WITH (KEY_FORMAT='JSON',VALUE_FORMAT='AVRO')
AS
SELECT id
, LATEST_BY_OFFSET(timestamp)
, LATEST_BY_OFFSET(servicename)
, LATEST_BY_OFFSET(content->assignedcontent)
FROM created_stream
WHERE content->assignedcontent IS NOT NULL
GROUP BY id;

Is distinct function deterministic? T-sql

I have table like below. For distinct combination of user ID and Product ID SQL will select product bought from store ID 1 or 2? Is it determinictic?
My code
SELECT (DISTINCT CONCAT(UserID, ProductID)), Date, StoreID FROM X
This isn't valid syntax. You can have
select [column_list] from X
or you can have
select distinct [column_list] from X
The difference is that the first will return one row for every row in the table while the second will return one row for every unique combination of the column values in your column list.
Adding "distinct" to a statement will reliably produce the same results every time unless the underlying data changes, so in this sense, "distinct" is deterministic. However, it is not a function so the term "deterministic" doesn't really apply.
You may actually want a "group by" clause like the following (in which case you have to actually specify how you want the engine to pick values for columns not in your group):
select
concat(UserId, ProductID)
, min(Date)
, max(Store)
from
x
group by
concat(UserId, ProductID)
Results:
results

Rolling up events to a minute in Apache Beam

So, I have a stream of data with this structure (I apologize it's in SQL)
CREATE TABLE github_events
(
event_id bigint,
event_type text,
event_public boolean,
repo_id bigint,
payload jsonb,
repo jsonb,
user_id bigint,
org jsonb,
created_at timestamp
);
In SQL, I would rollup this data up to a minute like this:
1.Create a roll-up table for this purpose:
CREATE TABLE github_events_rollup_minute
(
created_at timestamp,
event_count bigint
);
2.And populate with INSERT/SELECT:
INSERT INTO github_events_rollup_minute(
created_at,
event_count
)
SELECT
date_trunc('minute', created_at) AS created_at,
COUNT(*)the AS event_count
FROM github_events
GROUP BY 1;
In Apache Beam, I am trying to roll-up events to a minute, i.e. count the total number of events received in that minute as per event's timestamp field.
Timestamp(in YYYY-MM-DDThh:mm): event_count
So, later in the pipeline if we receive more events with the same overlapping timestamp (due to the event receiving delays as the customer might be offline), we just need to take the roll-up count and increment the count for that timestamp.
This will allow us to simply increment the count for YYYY-MM-DDThh:mm by event_count in the application.
Assuming, events might be delayed but they'll always have the timestamp field.
I would like to accomplish the same thing in Apache Beam. I am very new to Apache Beam, I feel that I am missing something in Beam that would allow me to accomplish this. I've read the Apache Beam Programming Guide multiple times.
Take a look at the sections on Windowing and Triggers. What you're describing is fixed-time windows with allowed late data. The general shape of the pipeline sounds like:
Read input github_events data
Window into fixed windows of 1 minute, allowing late data
Count events per-window
Output the result to github_events_rollup_minute
The WindowedWordCount example project demonstrates this pattern.

Cassandra group by and filter results

I'm trying to mimic something like this:
Given a table test:
CREATE TABLE myspace.test (
item_id text,
sub_id text,
quantity bigint,
status text,
PRIMARY KEY (item_id, sub_id)
In SQL, we could do:
select * from (select item_id, sum(quantity) as quan
from test where status <> 'somevalue') sub
where sub.quan >= 10;
i.e. group by item_id and then filter out the results with less than 10.
Cassandra is not designed for this kind of stuff though I could mimic group by using user-defined aggregate functions:
CREATE FUNCTION group_sum_state
(state map<text, bigint>, item_id text, val bigint)
CALLED ON NULL INPUT
RETURNS map<text, bigint>
LANGUAGE java
AS $$Long current = (Long)state.get(item_id);
if(current == null) current = 0l;
state.put(item_id, current + val); return state;$$;
CREATE AGGREGATE group_sum(text, bigint)
SFUNC group_sum_state
STYPE map<text, bigint>
INITCOND { }
And use it as group by (probably this is going to have very bad performance, but still):
cqlsh:myspace> select group_sum(item_id, quantity) from test;
mysales_data.group_sum(item_id, quantity)
-------------------------------------------
{'123': 33, '456': 14, '789': 15}
But it seems to be impossible to do filtering by map values, neither with final function for the aggregate nor with a separate function. I could define a function like this:
CREATE FUNCTION myspace.filter_group_sum
(group map<text, bigint>, vallimit bigint)
CALLED ON NULL INPUT
RETURNS map<text, bigint>
LANGUAGE java
AS $$
java.util.Iterator<java.util.Map.Entry<String, Long>> entries =
group.entrySet().iterator();
while(entries.hasNext()) {
Long val = entries.next().getValue();
if (val < vallimit)
entries.remove();
};
return group;$$;
But there is no way to call it and pass a constant:
select filter_group_sum(group_sum(item_id, quantity), 15) from test;
SyntaxException: <ErrorMessage code=2000 [Syntax error in CQL query]
message="line 1:54 no viable alternative at input '15'
(...(group_sum(item_id, quantity), [15]...)">
it complains about the constant 15.
Sorry for the long post, I need to provide all the details to explain what I need.
So my questions are:
Is there a way pass in a constant to a user-defined function in Cassandra. Or what alternatives do I have to implemented filtered group by.
More general question: what is the proper data design for Cassandra to cover such a use-case for a real-time query-serving application? Say I have a web app that takes the limit from the UI and needs to return back all the items that total quantity larger than the given limit? The tables are going to quite large, like 10 billions of records.
Vanilla Cassandra is a poor choice for ad hoc queries. DataStax Enterprise has added some of this functionality via integrations with Spark and Solr. The Spark integration is also open source, but you wouldn't want to do this for low-latency queries. If you need real-time queries, you're going to have to aggregate outside of Cassandra (in Spark or Storm, for example), then write back the aggregates to be consumed by your app. You can also look at Stratio's Lucene integration, which might help you for some of your queries.
I ran across your question when looking for information on passing a constant to a user defined function.
The closest I can get to passing a constant is to pass a static column for the parameter for which you want to pass a constant. So if you update the static column before using the UDF, then you can pass that column. This will only work if you have a single client running such a query at a time, since the static column is visible to all clients. See this answer for an example:
Passing a constant to a UDF

Grouping in SSRS 2008 - Filtering

I'm new to SSRS. I want to group a transactions table by customerid and count how many transactions per customerid. I was able to do that.
But then I want to sort by that count, and/or filter by that count. How do you that?
Thanks!
To set up sorting and filtering on row groups, right click on the row group.
You can access the Group Sorting and Filtering properties here. They should both allow you to set up rules based on the name of your count column.
Option 1
If you have no need to show the transactions in the report then the aggregation should be performed at the database level in the query, not by SSRS. You'll get the benefits of:
Faster rendering.
You'll be sending less data over the network.
There'll be less data for the SSRS engine to process, therefore any ordering can be performed quicker.
Your data set can be 'pre-ordered' by putting the most common/expected values in the ORDER BY clause of the underlying query.
Thereby giving any rendering a speed boost, also.
Any filters can be applied directly against the aggregated data returned by the query without having to try and do complex expressions in SSRS.
This will also give a performance boost when rendering.
You could have a "filter" parameter that could be used in the HAVING clause of an aggregate query
Again, a performance boost due to less data across the network, and to be processed.
Gives a level of interactivity to your reports as opposed to trying to pre-define user tastes and having filter conditions set on expressions or a 'best-guess'.
Example
-- Will filter out any customers who have 2 or fewer transactions
DECLARE #Filter AS int = 2
;
SELECT
CustomerId
,COUNT(TransactionId)
FROM
Transactions
GROUP BY
CustomerId
HAVING
COUNT(TransactionId) > #Filter
Option 2
If you still need to show the transactions, then add an additional column to your query that performs the Count() using the OVER clause and PARTITION BY customerid, like so:
COUNT(transactions) OVER (PARTITION BY customerid) AS CustomerTransactionCount
Assuming a very simple table structure you'll end up with a query structure like so:
SELECT
CustomerId
,TransactionId
,TransactionAttribute_1
,TransactionAttribute_2
,TransactionAttribute_3
.
.
.
,TransactionAttribute_n
,COUNT(TransactionId) OVER (PARTITION BY CustomerId) AS CustomerTransactionCount
FROM
Transactions
You'll be able to use CustomerTransactionCount as a filter and sorting column in any row/column groups within SSRS.
Drawback of this approach
Window functions, i.e. using the OVER (PARTITION BY...) cannot be used in HAVING clauses as no GROUP BY clause is used. This means any filtering will have to be carried out by SSRS.
Workaround options
We take the query above and wrap a CTE around it. This will allow us to filter based on the aggregate results.
Put the aggregate in a derived table.
CTE Example
--Filter variable
DECLARE #Filter AS int = 2
;
WITH DataSet_CTE AS
(
-- Build the data set with transaction information and the aggregate column
SELECT
CustomerId
,TransactionId
,TransactionAttribute_1
,TransactionAttribute_2
,TransactionAttribute_3
.
.
.
,TransactionAttribute_n
,COUNT(TransactionId) OVER (PARTITION BY CustomerId) AS CustomerTransationCount
FROM
Transactions
)
-- Filter and return data
SELECT *
FROM DataSet_CTE
WHERE CustomerTransationCount > #Filter
Derived Table Example
--Filter variable
DECLARE #Filter AS int = 2
;
SELECT
*
FROM
(
-- Build the data set with transaction information and the aggregate column
SELECT
CustomerId
,TransactionId
,TransactionAttribute_1
,TransactionAttribute_2
,TransactionAttribute_3
.
.
.
,TransactionAttribute_n
,COUNT(TransactionId) OVER (PARTITION BY CustomerId) AS CustomerTransationCount
FROM
Transactions
) AS DataSet
WHERE
DataSet.CustomerTransationCount > #Filter