How to Construct Nested JSON Message on Output Topic in KSQLDB - apache-kafka

from one of the source systems i received the below event payload
Created Stream1 for the below json payload
Event JSON 1
{
"event": {
"header": {
"name":"abc",
"version":"1.0",
"producer":"123",
"channel":"lab",
"countryCode":"US"
},
"body":{"customerIdentifiers":[
{"customerIdentifier":"1234","customerIdType":"cc"},
{"customerIdentifier":"234","customerIdType":"id"}
],
"accountIdentifiers":[
{"accountIdentifier":"123","accountIdType":"no"},
{"accountIdentifier":"Primary","accountIdType":"da"}
],
"eventDetails":{
"offeramount":"40000",
"apr":"2.6%",
"minpayment":"400",
"status":"Approved"
}
}
}
Event JSON 2
{
"event": {
"header": {
"name":"abc",
"version":"1.0",
"producer":"123",
"channel":"lab",
"countryCode":"US"
},
"body":{"customerIdentifiers":[
{"customerIdentifier":"1234","customerIdType":"cc"},
{"customerIdentifier":"234","customerIdType":"id"}
],
"accountIdentifiers":[
{"accountIdentifier":"123","accountIdType":"no"},
{"accountIdentifier":"Primary","accountIdType":"da"}
],
"eventDetails":{
"offeramount":"70000",
"apr":"3.6%",
"minpayment":"600",
"status":"Rejected"
}
}
}
I have created aggregation table on the the above stream1
CREATE TABLE EVENT_TABLE AS
SELECT
avg(minpayment) as Avg_MinPayment,
avg(apr) AS Avg_APr,
avg(offeramount) AS Avgofferamount ,
status
FROM STREAM1
GROUP BY status
EMIT CHANGES;
Status | Avg_MinPayment | Avg_APr | Avgofferamount
-----------------------------------------
Approved | 400 | 2.6% | 40000
Rejected | 600 | 3.6% | 70000
I got the above result from KTable and KTable Topic json look like this
Aggregate JSON1
PRINT 'EVENT_TABLE';
{
"Status" : "Approved",
"Avg_Minpayment" : "400",
"Avg_APr" : "2.6%",
"offeramount" : "40000"
}
Aggregate JSON2
{
"Status" : "Rejected",
"Avg_Minpayment" : "600",
"Avg_APr" : "3.6%",
"offeramount" : "70000"
}
But i have to Construct and publish the final target json on output topic like below json format. i have to add the header and body to the aggregate json1 and aggregate json2.
{
"event":{
"header":{
"name":"abc",
"version":"1.0",
"producer":"123",
"channel":"lab",
"countryCode":"US"
},
"body":{
"Key":[
{"Status":"approved","Avg_Minpayment":"400","Avg_APr":"2.6%","offeramount":"40000"},
{"Status":"rejected","Avg_Minpayment":"600","Avg_APr":"3.6%","offeramount":"70000"}
]
}
}

It's not terribly clear what you're trying to achieve, given that your example SQL won't produce the example output, given then example input. In fact your example SQL would fail with unknown column errors.
Something like the following would generate your example output:
CREATE TABLE EVENT_TABLE AS
SELECT
status,
avg(eventDetails->minpayment) as Avg_MinPayment,
avg(eventDetails->apr) AS Avg_APr,
avg(eventDetails->offeramount) AS Avgofferamount
FROM STREAM1
GROUP BY status
EMIT CHANGES;
Next, your example output...
Status | Avg_MinPayment | Avg_APr | Avgofferamount
-----------------------------------------
Approved | 400 | 2.6% | 40000
Rejected | 600 | 3.6% | 70000
...is outputting one row per status. Yet, the output you say you want to achieve ...
{
"event":{
"header":{
"name":"abc",
"version":"1.0",
"producer":"123",
"channel":"lab",
"countryCode":"US"
},
"body":{
"Key":[
{"Status":"approved","Avg_Minpayment":"400","Avg_APr":"2.6%","offeramount":"40000"},
{"Status":"rejected","Avg_Minpayment":"600","Avg_APr":"3.6%","offeramount":"70000"}
]
}
}
...contains both statuses, i.e. its combining both of your example input messages into a single output.
If I'm understanding you correctly, and you do indeed want to output the above JSON, then:
You would first need to include the event information. But which event information? If you know they're always going to be the same, then you can use:
CREATE TABLE EVENT_TABLE AS
SELECT
status,
latest_by_offset(event) as event,
avg(eventDetails->minpayment) as Avg_MinPayment,
avg(eventDetails->apr) AS Avg_APr,
avg(eventDetails->offeramount) AS Avgofferamount
FROM STREAM1
GROUP BY status
EMIT CHANGES;
The latest_by_offset aggregate function will capture the event information from the last message it saw. Though I'm not convinced this is what you want. Could you not be getting other rejected and accepted messages with different event information? If it is the event information that identifies which messages should be grouped together, then something like this might give you something close to what you want:
CREATE TABLE EVENT_TABLE AS
SELECT
event,
collect_list(eventDetails) as body
FROM STREAM1
GROUP BY event
EMIT CHANGES;
If this is close, then you may want to use the STRUCT constructor and AS_VALUE function to restructure your output. For example:
CREATE TABLE EVENT_TABLE AS
SELECT
event as key,
AS_VALUE(event) as event,
STRUCT(
keys := collect_list(eventDetails)
) as body
FROM STREAM1
GROUP BY event
EMIT CHANGES;

Related

How to handle nested array in a DRUID

My json is as below:
{
"id":11966121,
"employer_id":175,
"account_attributes":[
{
"id":155387028,
"is_active":false,
"created_at":"2018-06-06T02:12:25.243Z",
"updated_at":"2021-03-15T17:38:04.598Z"
},
{
"id":155387062,
"is_active":true,
"created_at":"2018-06-06T02:12:25.243Z",
"updated_at":"2021-03-15T17:38:04.598Z"
}
],
"created_at":"2017-12-13T18:31:04.000Z",
"updated_at":"2021-03-14T23:50:43.180Z"
}
I want to parse the message and have a table with flatten account_attributes
Considering the sample payload the o/p should have two rows:
id |account_attributes_id| is_active | created_at | updated_at|
11966121|155387028|false |2018-06-06T02:12:25.243Z|2021-03-15T17:38:04.598Z |
11966121|155387062|true |2018-06-06T02:12:25.243Z|2021-03-15T17:38:04.598Z |
Is this possible?

How can I count users that generate events within a certain period of time with Kafka Streams?

I have streaming events which has user-id in it. I want to count how many distinct user generate an event within certain of time. However, I am beginner in Kafka and I cannot cope with the problem.
Example events in 1 minutes;
{"event_name": "viewProduct", "user_id": "12"}
{"event_name": "viewProductDetails", "user_id": "23"}
{"event_name": "viewProductComments", "user_id": "12"}
{"event_name": "viewProduct", "user_id": "23"}
{"event_name": "viewProductComments", "user_id": "32"}
My code should generate there are 3 active users according to the events above.
My approach is as follows, however this solution cannot eliminate multiple event from same user and count the same user multiple times.
builder.stream("orders") // read from orders toic
.mapValues(v -> { // get user_id via json parser
JsonNode jsonNode = null;
try {
jsonNode = objectMapper.readTree((String) v);
return jsonNode.get("user_id").asText();
} catch (JsonProcessingException e) {
e.printStackTrace();
}
return "";
})
.selectKey((k, v) -> "1") // put same key to every user_id
.groupByKey()
.windowedBy(TimeWindows.of(Duration.ofSeconds(1))) // use time windows
.count() // count values
I might be missing something here by why don't you just do:
.selectKey((k, v) -> v)
That will group the record by value, which you previously populated with user_id.

Cross-venue visitor reporting approach in Location Based Service system

I'm finding an approach to resolve cross-venue vistor report for my client, he wants an HTTP API that return the total unique count of his customer who has visited more than one shop in day range (that API must return in 1-2 seconds).
The raw data sample (...millions records in reality):
--------------------------
DAY | CUSTOMER | VENUE
--------------------------
1 | cust_1 | A
2 | cust_2 | A
3 | cust_1 | B
3 | cust_2 | A
4 | cust_1 | C
5 | cust_3 | C
6 | cust_3 | A
Now, I want to calculate the cross-visitor report. IMO the steps would be as following:
Step 1: aggregate raw data from day 1 to 6
--------------------------
CUSTOMER | VENUE VISIT
--------------------------
cus_1 | [A, B, C]
cus_2 | [A]
cus_3 | [A, C]
Step 2: produce the final result
Total unique cross-customer: 2 (cus_1 and cus_3)
I've tried somes solutions:
I firstly used MongoDB to store data, then using Flask to write an API that uses MongoDB's utilities: aggregation, addToSet, group, count... But the API's response time is unacceptable.
Then, I switched to ElasticSearch with hope on its Aggregation command sets, but they do not support pipeline group command on the output result from the first "terms" aggregation.
After that, I read about Redis Sets, Sorted Sets,... But they couldn't help.
Could you please show me a clue to solve my problem.
Thank in advanced!
You can easily do this with Elasticsearch by leveraging one date_histogram aggregation to bucket by day, two terms aggregations (first bucket by customer and then by venue) and then only select the customers which visited more than one venue any given day using the bucket_selector pipeline aggregation. It looks like this:
POST /sales/_search
{
"size": 0,
"aggs": {
"by_day": {
"date_histogram": {
"field": "date",
"interval": "day"
},
"aggs": {
"customers": {
"terms": {
"field": "customer.keyword"
},
"aggs": {
"venues": {
"terms": {
"field": "venue.keyword"
}
},
"cross_selector": {
"bucket_selector": {
"buckets_path": {
"venues_count": "venues._bucket_count"
},
"script": {
"source": "params.venues_count > 1"
}
}
}
}
}
}
}
}
}
In the result set, you'll get customers 1 and 3 as expected.
UPDATE:
Another approach involves using a scripted_metric aggregation in order to implement the logic yourself. It's a bit more complicated and might not perform well depending on the number of documents and hardware you have, but the following algorithm would yield the response 2 exactly as you expect:
POST sales/_search
{
"size":0,
"aggs": {
"unique": {
"scripted_metric": {
"init_script": "params._agg.visits = new HashMap()",
"map_script": "def cust = doc['customer.keyword'].value; def venue = doc['venue.keyword'].value; def venues = params._agg.visits.get(cust); if (venues == null) { venues = new HashSet(); } venues.add(venue); params._agg.visits.put(cust, venues)",
"combine_script": "def merged = new HashMap(); for (v in params._agg.visits.entrySet()) { def cust = merged.get(v.key); if (cust == null) { merged.put(v.key, v.value) } else { cust.addAll(v.value); } } return merged",
"reduce_script": "def merged = new HashMap(); for (agg in params._aggs) { for (v in agg.entrySet()) {def cust = merged.get(v.key); if (cust == null) {merged.put(v.key, v.value)} else {cust.addAll(v.value); }}} def unique = 0; for (m in merged.entrySet()) { if (m.value.size() > 1) unique++;} return unique"
}
}
}
}
Response:
{
"took": 1413,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 7,
"max_score": 0,
"hits": []
},
"aggregations": {
"unique": {
"value": 2
}
}
}

REST: Single representation of responses for two REST requests that return object/objects of type 'A'

I have two REST URLs....
{host}/data/account
that returns a collection of accounts. We can model response for this request as shown below as a collection of accounts
{
"items": [{
"prop_name_1": "val_1",
"prop_name_2": "val_2",
|
|
|
|
"prop_name_n": "val_n",
"Links": [{child_links}]
}],
"Links": [{
{self_link},
{pagination_links}
}]
}
{host}/data/account/{particular_account_id}
that returns only a single account. We can model response for this request as only a single account as shown below
{
"prop_name_1": "val_1",
"prop_name_2": "val_2",
|
|
|
|
"prop_name_n": "val_n",
"Links": [{child_links}]
}
Now the question is, can I model responses for both these requests as only collection of accounts instead of two different representations?
The reason for doing it is the simplicity of parsing the response and also consistency in representation. Also, a collection can be used to represent a single object as well. Am I correct?

Group By / Sum Aggregate Query with Parse Cloud Code

I have Inventory table in my Parse database with two relevant fields: productId and quantity. When a shipment is received, a record is created containing the productId and quantity. Similarly, when a sale occurs, an inventory record is made with the productId and quantity (which will be negative, since the inventory will decrease after the sale).
I would like to run a group by/ sum aggregate query on the Inventory table with Parse Cloud Code that outputs a dictionary containing unique productIds as the keys and the sum of the quantity column for those Ids as the values.
I have seen a number of old posts saying that Parse does not do this, but then more recent posts refer to Cloud Code such as averageStart in the Cloud Code Guide: https://parse.com/docs/cloud_code_guide
However, it seems that Parse.Query used in averageStars has a maximum limit of 1000 records. Thus, when I sum the quantity column, I am only doing so on 1000 records rather than the whole table. Is there a way that I can compute the group by/ sum across all the records in the inventory table?
For example:
Inventory Table
productId quantity
Record 1: AAAAA 50
Record 2: BBBBB 40
Record 3: AAAAA -5
Record 4: BBBBB -2
Record 5: AAAAA 10
Record 6: AAAAA -7
Output dictionary:
{AAAAA: 48, BBBBB: 38}
You can use Parse.Query.each(). It has no limit. If your class has too many entries it will timeout though.
See docs
e.g.:
var totalQuantity = 0;
var inventoryQuery = new Parse.Query("Inventory");
inventoryQuery.each(
function(result){
totalQuantity += result.get("quantity");
}, {
success: function() {
// looped through everything
},
error: function(error) {
// error is an instance of Parse.Error.
}
});
});
If it times out, you have to build something like this.
In case you want to see the code with the dictionary:
Parse.Cloud.define("retrieveInventory", function(request, response) {
var productDictionary ={};
var query = new Parse.Query("Inventory");
query.equalTo("personId", request.params.personId);
query.each(
function(result){
var num = result.get("quantity");
if(result.get("productId") in productDictionary){
productDictionary[result.get("productId")] += num;
}
else{
productDictionary[result.get("productId")] = num;
}
}, {
success: function() {
response.success(productDictionary);
},
error: function(error) {
response.error("Query failed. Error = " + error.message);
}
});
});