Window That Keeps Aggregated Results In Memory - complex-event-processing

I'm using siddhi v4.1.10 in order to create long running java standalone app that will process data from multiple files, aggregate it and send results into message broker. Approximately there will be 90 GB of files per day with 130 millions of records (up to 100 000 events per minute). Each event created from file recod has ~30 fields.
Ex:
define stream FileStream (time double, user_ip string, status string, bytes int, http_method string, url string, duration int, ....);
Application should be able to run hundreds of queries that aggregate data for 1min, 5min, 10 min batches based on time from logfile.
Ex:
from FileStream#window.externalTimeBatch(time, 60 seconds)
SELECT time, account_id, status_code, sum(bytes) AS bytes, count(requests) AS requests GROUP BY account_id, status_code, time insert into rabbtmqstream
The problem is that batch window keeps events in memory until the time is up. But I need a system that can aggregate events over time periods without retaining any events in memory. Is it possible with siddhi?
I thought about OutputRateLimiter but it's not possible to extend it.
Please advise.

Use Siddhi 4.x version. This will not store the events in-memory when aggregations are done via timeBatch windows. In this case, it will only keep the running aggregate values in memory
Alternatively, you can also use aggregators to aggregate values with the use of both in-memory and database.

Related

Implementing API throttling with RDB

I would like to implement this API throttling:
A user can only execute the operation once per minute (once executed, following requests will be rejected for 1 minute)
The expected total number of requests from all users is around 2 per second.
I am using PostgreSQL 14.5.
I guess I will need a table for exclusive processing. What kind of SQL/algorithm should I use?
You could store the latest accepted timestamp in a column. Every time a request is processed, the code could check if the interval between the current timestamp and the last accepted timestamp is less than a minute and reject if so.

How to control retention over aggregate state store and changelog topic

My use case is the following:
Orders are flowing into an activation system via a topic. I have to Identify changes for records of same key. I compare the existing value with the new value using the aggregate function and output an event that points out the type of change identified i.e. DueDate Change.
The key is a randomly generated number and the number of unique keys is pretty much unbound. The same key will be reused in case the ordering system push a revision to an existing order.
The code has been running for a couple month in production but the state store and changelog topic are growing and there is a concern of space usage. I would like to have records expire after 90 days in the state store. I read about ways to apply a time based retention on state store and it looks like windowing the aggregation is a way of achieving that.
I understand that windowed aggregation are only available for tumbling and hopping window. Sliding window is available for join operation only.
Tumbling window wouldn't work in this case because I would have windows for 0-90, 90-180 and I wouldn't be able to identify an update on day 92 for a record that came in on day 89 (they wouldn't share the same window).
Now the only other option is hopping window.
TimeWindows timeWindow = TimeWindows.of(90days).advanceBy(1day).until(1day);
The problem is that I'll have to persist and update 90 windows. When the stream starts, 90 windows will be created 0-90, 1-91, 2-92, 3-93 etc. If I have a retention of 1 day on the windows, the window 0-90 will be cleaned up on day 91.
Now lets say on day 90 I get an update. Correct me if I'm wrong but my understanding is that I will have to update 90 windows and my state store will be quite large by that time because of all the duplicates. Maybe this is where I'm missing something. If a record is present in 90 windows, is it physically written on disk 90 times?
In the end all I need is to prevent my state store and changelog topic from growing indefinitely. 90 days of historical data is sufficient to support my use case.
Would there be a better way to approach this?
It might be simpler to not use the DSL but the Processor API with a windowed state store. A windowed state store is just a key-value store with expiration. Hence, you can use it similar to a key-value store -- you only provide an additional timestamp that will be used to expire data eventually.

Initial load of Kafka stream data with windowed join

I am using a Windowed Join between two streams, let's say a 7 day window.
On initial load, all records in the DB (via kafka connect source connector) are being loaded to the streams. It seems then that ALL records end up in the window state store for those first 7 days as the producer/ingested timestamps are all in current time vs. a field (like create_time) that might be in the message value.
Is there a recommended way to balance the initial load against the Windows of the join?
Well, the question is what records do you want to join to each other? And what timestamp the source connector sets as record timestamp (might also depend on the topic configuration, [log.]message.timestamp.type.
The join is executed based on whatever the TimestampExtractor returns. By default, that is the record timestamp. If you want to base the join on some other timestamp, a custom timestampe extractor is the way to go.
If you want to get processing time semantics, you may want to use the WallclockTimestampExtractor though.

Need advice on storing time series data in aligned 10 minute batches per channel

I have time series data in Kafka. The schema is quite simple - the key is the channel name, and the values are Long/Double tuples of the timestamp and the value (in reality it's a custom Avro object but it boils down to this). They always come in correct chronological order.
The wanted end result is data packaged in 10 minute batches, aligned at 10 minutes (i.e., 00:00 < t <= 00:10, 00:10 < t <= 00:20, ..., 23: 50 < t <= 00:00). Each package is to contain only data of one channel.
My idea is to have two Spark Streaming jobs. The first one takes the data from the Kafka topics and dumps it to a table in a Cassandra database where the key is the timestamp and the channel name, and every time such an RDD hits a 10 minute boundary, this boundary is posted to another topic, alongside the channel whose boundary is hit.
The second job listens to this "boundary topic", and for every received 10 minute boundary, the data is pulled from Cassandra, some calculations like min, max, mean, stddev are done and the data and these results are packaged to a defined output directory. That way, each directory contains the data from one channel and one 10 minute window.
However, this looks a bit clunky and like a lot of extra work to me. Is this a feasible solution or are there any other more efficient tricks to it, like some custom windowing of the Kafka data?
I agree with your intuition that this solution is clunky. How about simply using the time windowing functionality built into the Streams DSL?
http://kafka.apache.org/11/documentation/streams/developer-guide/dsl-api.html#windowing
The most natural output would be a new topic containing the windowed aggregations, but if you really need it written to a directory that should be possible with Kafka Connect.
I work with the Flink Stream Processing, not Spark-streaming but I guess the programming concept of both of them is alike. So supposing data are ordered chronologically and you want to aggregate data for every 10 minutes and do some processing on aggregated data, I think the best approach is to use the Streaming Window Functions. I suggest to define a function to map every incoming data's timestamp to the last 10 minutes:
12:10:24 ----> 12:10:00
12:10:30 ----> 12:10:00
12:25:24 ----> 12:20:00
So you can create a keyed stream object like:
StreamObject<Long, Tuple<data>>
That the Long field is the mapped timestamp of every message. Then you can apply a window. You should search what kind of window is more appropriate for your case.
Point: Setting a key for the data stream will cause the window function to consider a logical window for every key.
In the most simple case, you should define a time window of 10 minutes and aggregate all data incoming on that period of time.
The other approach, if you know the rate of generating of data and how many messages will be generated in a period of 10 minutes, is to use Count window. For example, a window with the count of 20 will listen to the stream and aggregate all the messages with the same key in a logical window and apply the window function just when the number of messages in the window reaches 20.
After messages aggregated in a window as desired, you could apply your processing logic using a reduce function or some action like that.

Handling embargoed content scenario in MarkLogic

I have a MarkLogic 7 database in which several documents are inserted and every document has its own created-on and released-on. Say for example if a document is inserted into the database at 1400 hrs and its released-on value is 1700 hrs then I need to POST this document to an external REST service at 1700 hrs.
I have tried the following options:
Configure a CPF pipeline such that whenever a document is inserted it's released-on value is read and a Scheduled Task is created to trigger based on the timestamp value read from released-on.
Following are the queries/ observations for this approach:
Since admin configuration manipulation APIs are not transactionally protected operations I need to force a lock on some URI in order to create Scheduled Tasks from within CPF action modules running in parallel.
For details read here
When I insert 1000 documents it takes around 20 minutes for the CPF action modules to trigger and create 1000 scheduled tasks based on the released-on value read from the inserted document.
How can I pass the URI of the document that triggered the CPF action module to the Schedule Task which got created from within the CPF action module based on the released-on value read from the document?
Configure a CPF pipeline such that whenever a document is inserted it's released-on value is read and xdmp:sleep() is called with the milliseconds remaining between current date Time and the value of released-on in the document.
Following are the queries/ observations for this approach:
The Task Server threads on which the CPF action modules are triggered remain occupied and are not released when xdmp:sleep() is called from within them due to which at any time CPF action module is triggered for 16 maximum documents and others remain in queue.
Is there any way we can configure the sleeping thread to become inactive and let other queued action modules to get triggered and when the sleep duration has been elapsed then it again becomes active?
Configure a muti-step CPF pipeline as described here in which the document keeps bouncing between two states till the time released-on timestamp has arrived.
Following are the queries/ observations for this approach:
Even when 30 documents were inserted the CPU utilization was observed to be 100%
In all the attempts a lot of system resources (CPU and RAM) get utilized even for as small as 1000 documents. I need to find an approach that can cater even 100K documents.
Please let me know in case there are any improvements that can be done in the above mentioned approaches or MarkLogic provides some other way to efficiently handle such scenarios.
Rather than CPF, you could set up a scheduled job that will run, say, every 10 minutes and look for documents that are ready to be published. That job would look for documents with released-on values between fn:current-dateTime() and the last time the job ran, which I would save in the database.
For each of those documents, you would spawn a task to POST the document, so that an error in one doesn't cause problems for the others. After looping through, save the current time in the database for the next time.
The 10-minute window can be as large or small as you like, depending on your tolerance for a little delay.