I have a Flink application on AWS Kinesis Analytics service. I need to filter some values on a data stream based on a threshold. Also, I'm passing the threshold parameter using AWS Systems Manager Parameter Store service. For now, I got this:
In my Main class:
val threshold: Int = ssmParameter.getParameterRequest(ssmClient, "/kinesis/threshold").toInt
val kinesis_deserialization_schema = new KinesisDeserialization[ID]
val KinesisConsumer = new FlinkKinesisConsumer[ID](
"Data-Stream",
kinesis_deserialization_schema,
consumerProps
)
val KinesisSource = env.addSource(KinesisConsumer).name(s"Kinesis Data")
val valid_data = KinesisSource
.filter(new MyFilter[ID](threshold))
.name("FilterData")
.uid("FilterData")
Filter class:
import cl.mydata.InputData
import org.apache.flink.api.common.functions.FilterFunction
class MyFilter[ID <: InputData](
threshold: Int
) extends FilterFunction[ID] {
override def filter(value: ID): Boolean = {
value.myvalue > threshold
}
}
}
This works fine, the thing is that I need to update the threshold parameter every hour, because that value can be changed by my client.
Perhaps you can implement the ProcessingTimeCallback interface in the MyFilter class, which supports timer operations, and you can update the threshold in the onProcessingTime function
public class MyFilter extends FilterFunction<...> implements ProcessingTimeCallback {
int threshold;
#Override
public void open(Configuration parameters) throws Exception {
scheduler.scheduleAtFixedRate(this, 1, 1, TimeUnit.HOURS);
final long now = getProcessingTimeService().getCurrentProcessingTime();
getProcessingTimeService().registerTimer(now + 3600000, this);
}
#Override
public boolean filter(IN xxx) throws Exception {
return xxx > threshold;
}
#Override
public void onProcessingTime(long timestamp) throws Exception {
threshold = XXXX;
final long now = getProcessingTimeService().getCurrentProcessingTime();
getProcessingTimeService().registerTimer(now + 3600000, this);
}
}
You could turn the FilterFunction into a BroadcastProcessFunction, and broadcast new thresholds as they become available.
Related
Environment: I am running Apache Ignite v2.13.0 for the cache and the cache store is persisting to a Mongo DB v3.6.0. I am also utilizing Spring Boot (Java).
Question: When I have an expiration policy set, how do I remove the corresponding data from my persistent database?
What I have attempted: I have attempted to utilize CacheEntryExpiredListener but my print statement is not getting triggered. Is this the proper way to solve the problem?
Here is a sample bit of code:
#Service
public class CacheRemovalListener implements CacheEntryExpiredListener<Long, Metrics> {
#Override
public void onExpired(Iterable<CacheEntryEvent<? extends Long, ? extends Metrics>> events) throws CacheEntryListenerException {
for (CacheEntryEvent<? extends Long, ? extends Metrics> event : events) {
System.out.println("Received a " + event);
}
}
}
Use Continuous Query to get notifications about Ignite data changes.
ExecutorService mongoUpdateExecutor = Executors.newSingleThreadExecutor();
CacheEntryUpdatedListener<Integer, Integer> lsnr = new CacheEntryUpdatedListener<Integer, Integer>() {
#Override
public void onUpdated(Iterable<CacheEntryEvent<? extends Integer, ? extends Integer>> evts) {
for (CacheEntryEvent<?, ?> e : evts) {
if (e.getEventType() == EventType.EXPIRED) {
// Use separate executor to avoid blocking Ignite threads
mongoUpdateExecutor.submit(() -> removeFromMongo(e.getKey()));
}
}
}
};
var qry = new ContinuousQuery<Integer, Integer>()
.setLocalListener(lsnr)
.setIncludeExpired(true);
// Start receiving updates.
var cursor = cache.query(qry);
// Stop receiving updates.
cursor.close();
Note 1: EXPIRED events should be enabled explicitly with ContinuousQuery#setIncludeExpired.
Note 2: Query listeners should not perform any heavy/blocking operations. Offload that work to a separate thread/executor.
I implemented a connection to a kafka stream as described here. Now I attempt to write the data into a postgres database using a Jdbc sink.
Now the source with Kafka seems to have no type. So when writing statements for SQL it all looks like type Nothing.
How can I use fromSource that I have actually a typed source for Kafka?
What I so far tried is the following:
object Main {
def main(args: Array[String]) {
val builder = KafkaSource.builder
builder.setBootstrapServers("localhost:29092")
builder.setProperty("partition.discovery.interval.ms", "10000")
builder.setTopics("created")
builder.setBounded(OffsetsInitializer.latest)
builder.setStartingOffsets(OffsetsInitializer.earliest)
builder.setDeserializer(KafkaRecordDeserializationSchema.of(new CreatedEventSchema))
val source = builder.build()
val env = StreamExecutionEnvironment.getExecutionEnvironment
val streamSource = env
.fromSource(source, WatermarkStrategy.noWatermarks, "Kafka Source")
streamSource.addSink(JdbcSink.sink(
"INSERT INTO conversations (timestamp, active_conversations, total_conversations) VALUES (?,?,?)",
(statement, event) => {
statement.setTime(1, event.date)
statement.setInt(1, event.a)
statement.setInt(3, event.b)
},JdbcExecutionOptions.builder()
.withBatchSize(1000)
.withBatchIntervalMs(200)
.withMaxRetries(5)
.build(),
new JdbcConnectionOptions.JdbcConnectionOptionsBuilder()
.withUrl("jdbc:postgresql://localhost:5432/reporting")
.withDriverName("org.postgresql.Driver")
.withUsername("postgres")
.withPassword("veryverysecret:-)")
.build()
))
env.execute()
}
}
Which does not compile because event is of type Nothing. But I think it must not be that because with CreatedEventSchema Flink should be able to deserialise.
Maybe it also important to note that actually I just want to process the values of the kafka messages.
In Java you might do something like this:
KafkaSource<Event> source =
KafkaSource.<Event>builder()
.setBootstrapServers("localhost:9092")
.setTopics(TOPIC)
.setStartingOffsets(OffsetsInitializer.earliest())
.setValueOnlyDeserializer(new EventDeserializationSchema())
.build();
with a value deserializer along these lines:
public class EventDeserializationSchema extends AbstractDeserializationSchema<Event> {
private static final long serialVersionUID = 1L;
private transient ObjectMapper objectMapper;
#Override
public void open(InitializationContext context) {
objectMapper = JsonMapper.builder().build().registerModule(new JavaTimeModule());
}
#Override
public Event deserialize(byte[] message) throws IOException {
return objectMapper.readValue(message, Event.class);
}
#Override
public TypeInformation<Event> getProducedType() {
return TypeInformation.of(Event.class);
}
}
Sorry I don't have a Scala example handy, but hopefully this will point you in the right direction.
I have a stream processing application built with spring cloud streams & kafka streams,
this system takes logs from an application and compares them to observations made by another stream processor
and produces a score, the log stream is then split by the score (above & below some threshold).
The topology:
The issue:
So my problem is how to properly implement the "Log best observation selector processor",
There are a finite amount of observations at the moment the log is processed but there may be a lot of them.
So I came up with 2 solutions...
Group & Window log-scored-observations topic by log id and then reduce to get the highest score. (Problem: scoring all observations may take longer then the window)
Emit a scoring completed message after every scoring, join with log-relevant-observations, use log-scored-observations global table & interactive query to check that every observation id is in the global table store, when all ids are in the store map to the observation with the highest score. (Problem: global table does not appear to work when only used for interactive query)
What would be the best way to achieve what I'm trying?
I'm hoping not to create any partition, disk or memory bottleneck.
Everything has unique ids and tuples of relevant ids when the value is joined from log & observation.
(Edit: Switched text description of topology with a diagram & change title)
Solution #2 seems to work, but it emitted warnings because interactive queries takes some time to be ready - so I implemented the same solution with a Transformer:
#Slf4j
#Configuration
#RequiredArgsConstructor
#SuppressWarnings("unchecked")
public class LogBestObservationsSelectorProcessorConfig {
private String logScoredObservationsStore = "log-scored-observations-store";
private final Serde<LogEntryRelevantObservationIdTuple> logEntryRelevantObservationIdTupleSerde;
private final Serde<LogRelevantObservationIdsTuple> logRelevantObservationIdsTupleSerde;
private final Serde<LogEntryObservationMatchTuple> logEntryObservationMatchTupleSerde;
private final Serde<LogEntryObservationMatchIdsRelevantObservationsTuple> logEntryObservationMatchIdsRelevantObservationsTupleSerde;
#Bean
public Function<
GlobalKTable<LogEntryRelevantObservationIdTuple, LogEntryObservationMatchTuple>,
Function<
KStream<LogEntryRelevantObservationIdTuple, LogEntryRelevantObservationIdTuple>,
Function<
KTable<String, LogRelevantObservationIds>,
KStream<String, LogEntryObservationMatchTuple>
>
>
>
logBestObservationSelectorProcessor() {
return (GlobalKTable<LogEntryRelevantObservationIdTuple, LogEntryObservationMatchTuple> logScoredObservationsTable) ->
(KStream<LogEntryRelevantObservationIdTuple, LogEntryRelevantObservationIdTuple> logScoredObservationProcessedStream) ->
(KTable<String, LogRelevantObservationIdsTuple> logRelevantObservationIdsTable) -> {
return logScoredObservationProcessedStream
.selectKey((k, v) -> k.getLogId())
.leftJoin(
logRelevantObservationIdsTable,
LogEntryObservationMatchIdsRelevantObservationsTuple::new,
Joined.with(
Serdes.String(),
logEntryRelevantObservationIdTupleSerde,
logRelevantObservationIdsTupleSerde
)
)
.transform(() -> new LogEntryObservationMatchTransformer(logScoredObservationsStore))
.groupByKey(
Grouped.with(
Serdes.String(),
logEntryObservationMatchTupleSerde
)
)
.reduce(
(match1, match2) -> Double.compare(match1.getScore(), match2.getScore()) != -1 ? match1 : match2,
Materialized.with(
Serdes.String(),
logEntryObservationMatchTupleSerde
)
)
.toStream()
;
};
}
#RequiredArgsConstructor
private static class LogEntryObservationMatchTransformer implements Transformer<String, LogEntryObservationMatchIdsRelevantObservationsTuple, KeyValue<String, LogEntryObservationMatchTuple>> {
private final String stateStoreName;
private ProcessorContext context;
private TimestampedKeyValueStore<LogEntryRelevantObservationIdTuple, LogEntryObservationMatchTuple> kvStore;
#Override
public void init(ProcessorContext context) {
this.context = context;
this.kvStore = (TimestampedKeyValueStore<LogEntryRelevantObservationIdTuple, LogEntryObservationMatchTuple>) context.getStateStore(stateStoreName);
}
#Override
public KeyValue<String, LogEntryObservationMatchTuple> transform(String logId, LogEntryObservationMatchIdsRelevantObservationsTuple value) {
val observationIds = value.getLogEntryRelevantObservationsTuple().getRelevantObservations().getObservationIds();
val allObservationsProcessed = observationIds.stream()
.allMatch((observationId) -> {
val key = LogEntryRelevantObservationIdTuple.newBuilder()
.setLogId(logId)
.setRelevantObservationId(observationId)
.build();
return kvStore.get(key) != null;
});
if (!allObservationsProcessed) {
return null;
}
val observationId = value.getLogEntryRelevantObservationIdTuple().getObservationId();
val key = LogEntryRelevantObservationIdTuple.newBuilder()
.setLogId(logId)
.setRelevantObservationId(observationId)
.build();
ValueAndTimestamp<LogEntryObservationMatchTuple> observationMatchValueAndTimestamp = kvStore.get(key);
return new KeyValue<>(logId, observationMatchValueAndTimestamp.value());
}
#Override
public void close() {
}
}
}
I'm trying to make an event processing stream using apache beam.
Steps which happen in my stream:
Read from kafka topics in avro format & deserialize avro using schema registry
Create Fixed Size window (1 hour) with triggering every 10 min (processing time)
Write avro files in GCP dividing directories by topic name. (filename = schema + start-end-window-pane)
Now let's deep into code.
This code shows how I read from Kafka. I use custom deserializer and coder to deserialize properly using schema registry (in my case it's hortonworks).
KafkaIO.<String, AvroGenericRecord>read()
.withBootstrapServers(bootstrapServers)
.withConsumerConfigUpdates(configUpdates)
.withTopics(inputTopics)
.withKeyDeserializer(StringDeserializer.class)
.withValueDeserializerAndCoder(BeamKafkaAvroGenericDeserializer.class, AvroGenericCoder.of(serDeConfig()))
.commitOffsetsInFinalize()
.withoutMetadata();
In pipeline after reading records by KafkaIO is creating windowing.
records.apply(Window.<AvroGenericRecord>into(FixedWindows.of(Duration.standardHours(1)))
.triggering(AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(10)))
.withLateFirings(AfterPane.elementCountAtLeast(1))
)
.withAllowedLateness(Duration.standardMinutes(5))
.discardingFiredPanes()
)
What I want to achieve by this window is to group data by event time every 1 hour and trigger every 10 min.
After grouping by a window it starts writing into Google Cloud Storage (GCS).
public class WriteAvroFilesTr extends PTransform<PCollection<AvroGenericRecord>, WriteFilesResult<AvroDestination>> {
private String baseDir;
private int numberOfShards;
public WriteAvroFilesTr(String baseDir, int numberOfShards) {
this.baseDir = baseDir;
this.numberOfShards = numberOfShards;
}
#Override
public WriteFilesResult<AvroDestination> expand(PCollection<AvroGenericRecord> input) {
ResourceId tempDir = getTempDir(baseDir);
return input.apply(AvroIO.<AvroGenericRecord>writeCustomTypeToGenericRecords()
.withTempDirectory(tempDir)
.withWindowedWrites()
.withNumShards(numberOfShards)
.to(new DynamicAvroGenericRecordDestinations(baseDir, Constants.FILE_EXTENSION))
);
}
private ResourceId getTempDir(String baseDir) {
return FileSystems.matchNewResource(baseDir + "/temp", true);
}
}
And
public class DynamicAvroGenericRecordDestinations extends DynamicAvroDestinations<AvroGenericRecord, AvroDestination, GenericRecord> {
private static final DateTimeFormatter formatter = DateTimeFormat.forPattern("yyyy-MM-dd HH:mm:ss");
private final String baseDir;
private final String fileExtension;
public DynamicAvroGenericRecordDestinations(String baseDir, String fileExtension) {
this.baseDir = baseDir;
this.fileExtension = fileExtension;
}
#Override
public Schema getSchema(AvroDestination destination) {
return new Schema.Parser().parse(destination.jsonSchema);
}
#Override
public GenericRecord formatRecord(AvroGenericRecord record) {
return record.getRecord();
}
#Override
public AvroDestination getDestination(AvroGenericRecord record) {
Schema schema = record.getRecord().getSchema();
return AvroDestination.of(record.getName(), record.getDate(), record.getVersionId(), schema.toString());
}
#Override
public AvroDestination getDefaultDestination() {
return new AvroDestination();
}
#Override
public FileBasedSink.FilenamePolicy getFilenamePolicy(AvroDestination destination) {
String pathStr = baseDir + "/" + destination.name + "/" + destination.date + "/" + destination.name;
return new WindowedFilenamePolicy(FileBasedSink.convertToFileResourceIfPossible(pathStr), destination.version, fileExtension);
}
private static class WindowedFilenamePolicy extends FileBasedSink.FilenamePolicy {
final ResourceId outputFilePrefix;
final String fileExtension;
final Integer version;
WindowedFilenamePolicy(ResourceId outputFilePrefix, Integer version, String fileExtension) {
this.outputFilePrefix = outputFilePrefix;
this.version = version;
this.fileExtension = fileExtension;
}
#Override
public ResourceId windowedFilename(
int shardNumber,
int numShards,
BoundedWindow window,
PaneInfo paneInfo,
FileBasedSink.OutputFileHints outputFileHints) {
IntervalWindow intervalWindow = (IntervalWindow) window;
String filenamePrefix =
outputFilePrefix.isDirectory() ? "" : firstNonNull(outputFilePrefix.getFilename(), "");
String filename =
String.format("%s-%s(%s-%s)-(%s-of-%s)%s", filenamePrefix,
version,
formatter.print(intervalWindow.start()),
formatter.print(intervalWindow.end()),
shardNumber,
numShards - 1,
fileExtension);
ResourceId result = outputFilePrefix.getCurrentDirectory();
return result.resolve(filename, RESOLVE_FILE);
}
#Override
public ResourceId unwindowedFilename(
int shardNumber, int numShards, FileBasedSink.OutputFileHints outputFileHints) {
throw new UnsupportedOperationException("Expecting windowed outputs only");
}
#Override
public void populateDisplayData(DisplayData.Builder builder) {
builder.add(
DisplayData.item("fileNamePrefix", outputFilePrefix.toString())
.withLabel("File Name Prefix"));
}
}
}
I've written down the whole of my pipeline. It kind of works well but I have misunderstood (not sure) that I handle events by event time.
Could someone review my code (especially 1 & 2 steps where I read and group by windows) either it windows by event time or not?
P.S. For every record in Kafka I have timestamp field inside.
UPD
Thanks jjayadeep
I include in KafkaIO custom TimestampPolicy
static class CustomTimestampPolicy extends TimestampPolicy<String, AvroGenericRecord> {
protected Instant currentWatermark;
CustomTimestampPolicy(Optional<Instant> previousWatermark) {
this.currentWatermark = previousWatermark.orElse(BoundedWindow.TIMESTAMP_MIN_VALUE);
}
#Override
public Instant getTimestampForRecord(PartitionContext ctx, KafkaRecord<String, AvroGenericRecord> record) {
currentWatermark = Instant.ofEpochMilli(record.getKV().getValue().getTimestamp());
return currentWatermark;
}
#Override
public Instant getWatermark(PartitionContext ctx) {
return currentWatermark;
}
}
From the documentation present here [1] event time is used as the processing time by default in KafkaIO
By default, record timestamp (event time) is set to processing time in KafkaIO reader and source watermark is current wall time. If a topic has Kafka server-side ingestion timestamp enabled ('LogAppendTime'), it can enabled with KafkaIO.Read.withLogAppendTime(). A custom timestamp policy can be provided by implementing TimestampPolicyFactory. See KafkaIO.Read.withTimestampPolicyFactory(TimestampPolicyFactory) for more information.
Also processing time is the default timestamp method used as documented below
// set event times and watermark based on LogAppendTime. To provide a custom
// policy see withTimestampPolicyFactory(). withProcessingTime() is the default.
1 - https://beam.apache.org/releases/javadoc/2.4.0/org/apache/beam/sdk/io/kafka/KafkaIO.html
I am trying to make a system resource monitor using Rx. I use a thread for observable which returns the CPU Usage every 1000 milliseconds. Now I want my subscriber to find the average of the CPU usage every 10 seconds.
public class seperate {
private ScheduledThreadPoolExecutor executorPool;
public void test()
{
Observable<Double> myObservable = Observable.create(
new Observable.OnSubscribe<Double>() {
#Override
public void call(Subscriber<? super Double> sub) {
executorPool = new ScheduledThreadPoolExecutor(9);
int timeout1 = 10;
TimerTask timeoutTimertask1=new MyTimerTasks(sub);
executorPool.scheduleAtFixedRate(timeoutTimertask1,timeout1, timeout1,
TimeUnit.MILLISECONDS);
// This returns the cpu usage every 10ms.
}
}
);
Subscriber<Double> mySubscriber = new Subscriber<Double>() {
#Override
public void onNext(Double s) { System.out.println(s); }
#Override
public void onCompleted() { }
#Override
public void onError(Throwable e) { }
};
myObservable.subscribe(mySubscriber);
}
}
You can use buffer or window to divide source emission into groups of items, then calculate average on each group.
Avarage is a part of rxjava-math library.
Moreover you can simplify your code using interval.
Below is example using window and interval:
Observable myObservable = Observable.interval(10, TimeUnit.MILLISECONDS)
.map(new Func1<Long, Double>() {
#Override
public Double call(Long aLong) {
Double d = 100.;//calculate CPU usage
return d;
}
})
.window(10, TimeUnit.SECONDS)
.flatMap(new Func1<Observable<Double>, Observable<Double>>() {
#Override
public Observable<Double> call(Observable<Double> windowObservable) {
return MathObservable.averageDouble(windowObservable);
}
});
Use MathObservable like so:
MathObservable.averageLong(longsObservable).subscribe(average -> Timber.d("average:%d", average));
More info can be found here:
https://github.com/ReactiveX/RxJava/wiki/Mathematical-and-Aggregate-Operators
Source Code:
https://github.com/ReactiveX/RxJavaMath