Keeping entries in a state store just for a defined time - apache-kafka

Problem: I need to find out how sent a message in last e.g. 24 hours. I have following stream and state store for lookups.
#SendTo(Bindings.MESSAGE_STORE)
#StreamListener(Bindings.MO)
public KStream<?, ?> groupBySender(KStream<String, Message> messages) {
return messages.selectKey((key,message) -> message.from)
.map((k,v) -> new KeyValue<>(k, v.sentAt.toString()))
.groupByKey()
.reduce((oldTimestamp, newTimestamp) -> newTimestamp,
Materialized.as(AggregatorApplication.MESSAGE_STORE))
.toStream();
}
it works fine
[
"key=123 value=2019-06-21T13:29:05.509Z",
"key=from value=2019-06-21T13:29:05.509Z",
]
so a look up loos like :
store.get(from);
but I would like to remove entries older then 24h automatically from to store, currently they will be persisted probable forever
Is there a better way how to do it? maybe some windowing operation or so?

Atm, KTables (that are basically key-value stores) don't support a TTL (cf. https://issues.apache.org/jira/browse/KAFKA-4212)
The current recommendation is to use a windowed-store if you want to expire data. You might want to use a custom .transform() instead of windowedBy().reduce() so, to get more flexibility. (cf. https://docs.confluent.io/current/streams/developer-guide/processor-api.html)

Related

Message batching Apache BEAM pipeline triggering immediately instead of after fixed window

I have an Apache BEAM pipeline that I would like to read from a Google PubSub topic, apply deduplication, and emit the messages to another Pubsub topic on (at the end of) 15-min fixed windows. Managed to get it working with the deduplication, however, the issue is that the messages seem to get sent to the topic immediately instead of waiting for the end of the 15-mins.
Even after applying Window.triggering(AfterWatermark.pastEndOfWindow()) it didn't seem to be working (i.e. messages are emitted immediately). (Ref: https://beam.apache.org/releases/javadoc/2.0.0/org/apache/beam/sdk/transforms/windowing/Window.html).
Seeking help on what is wrong with my code? Full code below:
Also, would it be correct to assume that the Deduplication takes the fixed window as its bound, or I would need to separately set the time domain for deduplication (Ref: https://beam.apache.org/releases/javadoc/2.21.0/org/apache/beam/sdk/transforms/Deduplicate.html seems to say that it would default to the time domain which would be the fixed window defined)
pipeline
.apply("Read from source topic", PubsubIO.<String>readStrings().fromTopic(SOURCE_TOPIC))
.apply("De-duplication ",
Deduplicate.<String>values()
)
.apply(windowDurationMins + " min Window",
Window.<String>into(FixedWindows.of(Duration.standardMinutes(windowDurationMins)))
.triggering(
AfterWatermark.pastEndOfWindow()
)
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes()
)
.apply("Format to JSON", ParDo.of(new DataToJson()))
.apply("Emit to sink topic",
PubsubIO.writeStrings().to(SINK_TOPIC)
);
[Update]
Tried the following but nothing seemed to work
Removed deduplication
Changed to .triggering(Repeatedly.forever(AfterWatermark.pastEndOfWindow()))
Read from topic with timestampAttribute: PubsubIO.<String>readStrings().fromTopic(SOURCE_TOPIC).withTimestampAttribute("publishTime"))
Windowing seems to require some kind of timestamp associated with each data element. However, .withTimestampAttribute("publishTime") from PubsubIO didn't seem to work. Is there something else I could try to add a timestamp to my data for windowing?
[Update 2]
Tried manually attaching a timestamp based on this ref (https://beam.apache.org/documentation/programming-guide/#adding-timestamps-to-a-pcollections-elements) as below - but it STILL doesn't work
.apply("Add timestamp", ParDo.of(new ApplyTimestamp()))
public class ApplyTimestamp extends DoFn<String, String> {
#ProcessElement
public void addTimestamp(ProcessContext context) {
try {
String data = context.element();
Instant timestamp = new Instant();
context.outputWithTimestamp(data, timestamp);
} catch(Exception e) {
LOG.error("Error timestamping", e);
}
}
}
At this point I feel like I'm about to go crazy lol...
A GBK transform is needed in-between the immediate windowing after reading from source and the deduplication logic. Windows are applied in the next GroupByKeys, including one within composite transforms. GBK will group elements by the combination of keys and windows. Also, note that the default trigger is already AfterWatermark with 0 allowed lateness

Spark: How to structure a series of side effect actions inside mapping transformation to avoid repetition?

I have a spark streaming application that needs to take these steps:
Take a string, apply some map transformations to it
Map again: If this string (now an array) has a specific value in it, immediately send an email (or do something OUTSIDE the spark environment)
collect() and save in a specific directory
apply some other transformation/enrichment
collect() and save in another directory.
As you can see this implies to lazily activated calculations, which do the OUTSIDE action twice. I am trying to avoid caching, as at some hundreds lines per second this would kill my server.
Also trying to mantaining the order of operation, though this is not as much as important: Is there a solution I do not know of?
EDIT: my program as of now:
kafkaStream;
lines = take the value, discard the topic;
lines.foreachRDD{
splittedRDD = arg.map { split the string };
assRDD = splittedRDD.map { associate to a table };
flaggedRDD = assRDD.map { add a boolean parameter under a if condition + send mail};
externalClass.saveStaticMethod( flaggedRDD.collect() and save in file);
enrichRDD = flaggedRDD.map { enrich with external data };
externalClass.saveStaticMethod( enrichRDD.collect() and save in file);
}
I put the saving part after the email so that if something goes wrong with it at least the mail has been sent.
The final 2 methods I found were these:
In the DStream transformation before the side-effected one, make a copy of the Dstream: one will go on with the transformation, the other will have the .foreachRDD{ outside action }. There are no major downside in this, as it is just one RDD more on a worker node.
Extracting the {outside action} from the transformation and mapping the already sent mails: filter if mail has already been sent. This is a almost a superfluous operation as it will filter out all of the RDD elements.
Caching before going on (although I was trying to avoid it, there was not much to do)
If trying to not caching, solution 1 is the way to go

Is it possible to create an RHQ plugin that collects historic measurements from files?

I'm trying to create an RHQ plugin to gather some measurements. It seems relativity easy to create a plugin that return a value for the present moment. However, I need to collect these measurements from files. These files are created on a schedule, for example one per hour, but they contain much finer measurements, for example a measurement for every minute. The file may look something like below:
18:00 20
18:01 42
18:02 39
...
18:58 12
18:59 15
Is it possible to create a RHQ plugin that can return many values with timestamps for a measurement?
I think you can within org.rhq.core.pluginapi.measurement.MeasurementFacet#getValues return as many values as you want within the MeasurementReport.
So basically open the file, seek to the last known position (if the file is always appended to), read from there and for each line you go
MeasurementData data = new MeasurementDataNumeric(timeInFile, request, valueFromFile);
report.add(data);
Of course alerting on this (historical) data is sort of questionable, as if you only read the file one hour later, the alert can not be retroactively fired at the time the bad value happened :->
Yes it is surely possible .
#Override
public void getValues(MeasurementReport report, Set<MeasurementScheduleRequest> metrics) throws Exception {
for (MeasurementScheduleRequest request : metrics) {
Double result = SomeReadUtilClass.readValueFromFile();
MeasurementData data = new MeasurementDataNumeric(request, result)
report.addData(data );
}
}
SomeReadUtilClass is a utility class to read the file and readValueFromFile is the function, you can write you login to read the value from file.
result is the Double variable that is more important, this result value you can calculate from database or read file. And then this result value you have to provide to MeasurementDataNumeric function MeasurementDataNumeric(request, result));

Manipulating form input values after submission causes multiple instances

I'm building a form with Yii that updates two models at once.
The form takes the inputs for each model as $modelA and $modelB and then handles them separately as described here http://www.yiiframework.com/wiki/19/how-to-use-a-single-form-to-collect-data-for-two-or-more-models/
This is all good. The difference I have to the example is that $modelA (documents) has to be saved and its ID retrieved and then $modelB has to be saved including the ID from $model A as they are related.
There's an additional twist that $modelB has a file which needs to be saved.
My action code is as follows:
if(isset($_POST['Documents'], $_POST['DocumentVersions']))
{
$modelA->attributes=$_POST['Documents'];
$modelB->attributes=$_POST['DocumentVersions'];
$valid=$modelA->validate();
$valid=$modelB->validate() && $valid;
if($valid)
{
$modelA->save(false); // don't validate as we validated above.
$newdoc = $modelA->primaryKey; // get the ID of the document just created
$modelB->document_id = $newdoc; // set the Document_id of the DocumentVersions to be $newdoc
// todo: set the filename to some long hash
$modelB->file=CUploadedFile::getInstance($modelB,'file');
// finish set filename
$modelB->save(false);
if($modelB->save()) {
$modelB->file->saveAs(Yii::getPathOfAlias('webroot').'/uploads/'.$modelB->file);
}
$this->redirect(array('projects/myprojects','id'=>$_POST['project_id']));
}
}
ELSE {
$this->render('create',array(
'modelA'=>$modelA,
'modelB'=>$modelB,
'parent'=>$id,
'userid'=>$userid,
'categories'=>$categoriesList
));
}
You can see that I push the new values for 'file' and 'document_id' into $modelB. What this all works no problem, but... each time I push one of these values into $modelB I seem to get an new instance of $modelA. So the net result, I get 3 new documents, and 1 new version. The new version is all linked up correctly, but the other two documents are just straight duplicates.
I've tested removing the $modelB update steps, and sure enough, for each one removed a copy of $modelA is removed (or at least the resulting database entry).
I've no idea how to prevent this.
UPDATE....
As I put in a comment below, further testing shows the number of instances of $modelA depends on how many times the form has been submitted. Even if other pages/views are accessed in the meantime, if the form is resubmitted within a short period of time, each time I get an extra entry in the database. If this was due to some form of persistence, then I'd expect to get an extra copy of the PREVIOUS model, not multiples of the current one. So I suspect something in the way its saving, like there is some counter that's incrementing, but I've no idea where to look for this, or how to zero it each time.
Some help would be much appreciated.
thanks
JMB
OK, I had Ajax validation set to true. This was calling the create action and inserting entries. I don't fully get this, or how I could use ajax validation if I really wanted to without this effect, but... at least the two model insert with relationship works.
Thanks for the comments.
cheers
JMB

Does setting a memcached key that already exists refresh the expiration time?

Let's say I have the following code:
Memcached->set('key', 'value', 60); (expire in one minute)
while (1) {
sleep 1 second;
data = Memcached->get('key');
// update data
Memcached->set('key', data, 60);
}
After 60 iterations of the loop, will the key expire and when reading it I'll get a NULL? Or will the continuous setting keep pushing the expiration time each time to 1 minute after the last Set?
The documentation mentions this, and I've tested this in different contexts and I'm pretty sure I got different results.
Ok, found my answer by experimentation in the end...
It turns out "Set" does extend the expiration, it's basically the same as deleting the item and Setting it again with a new expiration.
However, Increment doesn't extend the expiration. If you increment a key, it keeps the original expiration time it had when you Set it in the first place.
If you simply want to extend the expiration time for a particular key instead of essentially resetting the data each time, you can just use Memcached::touch
With the caveat that you must have binary protocol enabled according to the comment on the above page.
$memcached = new Memcached();
$memcached->setOption(Memcached::OPT_BINARY_PROTOCOL, true);
$memcached->touch('key', 120);
The set doesn't care whatsoever about what may have been there, and can't assume that it even came from the same application.
What all did you test and what kinds of results did you get? Memcached never guarantees to return a value, so if you saw it go missing, it's quite possible to construct a test that would lose that regardless of expiration.
The best documentation source is the Memcached protocol description
First, the client sends a command line which looks like this:
<command name> <key> <flags> <exptime> <bytes> [noreply]\r\n
- <command name> is "set", "add", "replace", "append" or "prepend"
As you may see, each of the commands above have the exptime field which is mandatory.
So, yes - it will update expiration time. Moreover, memcached creates new item with its own key / flags / expiration / value and replaces the old one with it.
If your goal is to simply extends the expiration time, use the command touch, that was created to set a new expiration time for a key.
See https://manned.org/memctouch or http://docs.libmemcached.org/bin/memtouch.html
Debian package: libmemcached-tools
From the shell: man memctouch
other distros use "memtouch" as the name of the command line tool
+1 Link from memcached protocol, as a manual reference:
https://github.com/memcached/memcached/blob/master/doc/protocol.txt#L318
Example:
memctouch --servers=localhost --expire=60 $key
Where $key is your 'key', this will set the expire time to 60 seconds, as in your example, but without having to make a "get" AND re-set the key. What if your 'key' is not set yet and your 'get' doesn't return some data?