I'm trying to make an event processing stream using apache beam.
Steps which happen in my stream:
Read from kafka topics in avro format & deserialize avro using schema registry
Create Fixed Size window (1 hour) with triggering every 10 min (processing time)
Write avro files in GCP dividing directories by topic name. (filename = schema + start-end-window-pane)
Now let's deep into code.
This code shows how I read from Kafka. I use custom deserializer and coder to deserialize properly using schema registry (in my case it's hortonworks).
KafkaIO.<String, AvroGenericRecord>read()
.withBootstrapServers(bootstrapServers)
.withConsumerConfigUpdates(configUpdates)
.withTopics(inputTopics)
.withKeyDeserializer(StringDeserializer.class)
.withValueDeserializerAndCoder(BeamKafkaAvroGenericDeserializer.class, AvroGenericCoder.of(serDeConfig()))
.commitOffsetsInFinalize()
.withoutMetadata();
In pipeline after reading records by KafkaIO is creating windowing.
records.apply(Window.<AvroGenericRecord>into(FixedWindows.of(Duration.standardHours(1)))
.triggering(AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(10)))
.withLateFirings(AfterPane.elementCountAtLeast(1))
)
.withAllowedLateness(Duration.standardMinutes(5))
.discardingFiredPanes()
)
What I want to achieve by this window is to group data by event time every 1 hour and trigger every 10 min.
After grouping by a window it starts writing into Google Cloud Storage (GCS).
public class WriteAvroFilesTr extends PTransform<PCollection<AvroGenericRecord>, WriteFilesResult<AvroDestination>> {
private String baseDir;
private int numberOfShards;
public WriteAvroFilesTr(String baseDir, int numberOfShards) {
this.baseDir = baseDir;
this.numberOfShards = numberOfShards;
}
#Override
public WriteFilesResult<AvroDestination> expand(PCollection<AvroGenericRecord> input) {
ResourceId tempDir = getTempDir(baseDir);
return input.apply(AvroIO.<AvroGenericRecord>writeCustomTypeToGenericRecords()
.withTempDirectory(tempDir)
.withWindowedWrites()
.withNumShards(numberOfShards)
.to(new DynamicAvroGenericRecordDestinations(baseDir, Constants.FILE_EXTENSION))
);
}
private ResourceId getTempDir(String baseDir) {
return FileSystems.matchNewResource(baseDir + "/temp", true);
}
}
And
public class DynamicAvroGenericRecordDestinations extends DynamicAvroDestinations<AvroGenericRecord, AvroDestination, GenericRecord> {
private static final DateTimeFormatter formatter = DateTimeFormat.forPattern("yyyy-MM-dd HH:mm:ss");
private final String baseDir;
private final String fileExtension;
public DynamicAvroGenericRecordDestinations(String baseDir, String fileExtension) {
this.baseDir = baseDir;
this.fileExtension = fileExtension;
}
#Override
public Schema getSchema(AvroDestination destination) {
return new Schema.Parser().parse(destination.jsonSchema);
}
#Override
public GenericRecord formatRecord(AvroGenericRecord record) {
return record.getRecord();
}
#Override
public AvroDestination getDestination(AvroGenericRecord record) {
Schema schema = record.getRecord().getSchema();
return AvroDestination.of(record.getName(), record.getDate(), record.getVersionId(), schema.toString());
}
#Override
public AvroDestination getDefaultDestination() {
return new AvroDestination();
}
#Override
public FileBasedSink.FilenamePolicy getFilenamePolicy(AvroDestination destination) {
String pathStr = baseDir + "/" + destination.name + "/" + destination.date + "/" + destination.name;
return new WindowedFilenamePolicy(FileBasedSink.convertToFileResourceIfPossible(pathStr), destination.version, fileExtension);
}
private static class WindowedFilenamePolicy extends FileBasedSink.FilenamePolicy {
final ResourceId outputFilePrefix;
final String fileExtension;
final Integer version;
WindowedFilenamePolicy(ResourceId outputFilePrefix, Integer version, String fileExtension) {
this.outputFilePrefix = outputFilePrefix;
this.version = version;
this.fileExtension = fileExtension;
}
#Override
public ResourceId windowedFilename(
int shardNumber,
int numShards,
BoundedWindow window,
PaneInfo paneInfo,
FileBasedSink.OutputFileHints outputFileHints) {
IntervalWindow intervalWindow = (IntervalWindow) window;
String filenamePrefix =
outputFilePrefix.isDirectory() ? "" : firstNonNull(outputFilePrefix.getFilename(), "");
String filename =
String.format("%s-%s(%s-%s)-(%s-of-%s)%s", filenamePrefix,
version,
formatter.print(intervalWindow.start()),
formatter.print(intervalWindow.end()),
shardNumber,
numShards - 1,
fileExtension);
ResourceId result = outputFilePrefix.getCurrentDirectory();
return result.resolve(filename, RESOLVE_FILE);
}
#Override
public ResourceId unwindowedFilename(
int shardNumber, int numShards, FileBasedSink.OutputFileHints outputFileHints) {
throw new UnsupportedOperationException("Expecting windowed outputs only");
}
#Override
public void populateDisplayData(DisplayData.Builder builder) {
builder.add(
DisplayData.item("fileNamePrefix", outputFilePrefix.toString())
.withLabel("File Name Prefix"));
}
}
}
I've written down the whole of my pipeline. It kind of works well but I have misunderstood (not sure) that I handle events by event time.
Could someone review my code (especially 1 & 2 steps where I read and group by windows) either it windows by event time or not?
P.S. For every record in Kafka I have timestamp field inside.
UPD
Thanks jjayadeep
I include in KafkaIO custom TimestampPolicy
static class CustomTimestampPolicy extends TimestampPolicy<String, AvroGenericRecord> {
protected Instant currentWatermark;
CustomTimestampPolicy(Optional<Instant> previousWatermark) {
this.currentWatermark = previousWatermark.orElse(BoundedWindow.TIMESTAMP_MIN_VALUE);
}
#Override
public Instant getTimestampForRecord(PartitionContext ctx, KafkaRecord<String, AvroGenericRecord> record) {
currentWatermark = Instant.ofEpochMilli(record.getKV().getValue().getTimestamp());
return currentWatermark;
}
#Override
public Instant getWatermark(PartitionContext ctx) {
return currentWatermark;
}
}
From the documentation present here [1] event time is used as the processing time by default in KafkaIO
By default, record timestamp (event time) is set to processing time in KafkaIO reader and source watermark is current wall time. If a topic has Kafka server-side ingestion timestamp enabled ('LogAppendTime'), it can enabled with KafkaIO.Read.withLogAppendTime(). A custom timestamp policy can be provided by implementing TimestampPolicyFactory. See KafkaIO.Read.withTimestampPolicyFactory(TimestampPolicyFactory) for more information.
Also processing time is the default timestamp method used as documented below
// set event times and watermark based on LogAppendTime. To provide a custom
// policy see withTimestampPolicyFactory(). withProcessingTime() is the default.
1 - https://beam.apache.org/releases/javadoc/2.4.0/org/apache/beam/sdk/io/kafka/KafkaIO.html
Related
I am generating Agents with parameter values coming from SQL table in Anylogic. when agent is generated at source I am doing a v look up in table and extracting corresponding values from table. For now it is working perfectly but it is slowing down the performance.
Structure of Table looks like this
I am querying the data from this table with below code
double value_1 = (selectFrom(account_details)
.where(account_details.act_code.eq(z))
.list(account_details.avg_value)).get(0);
double value_min = (selectFrom(account_details)
.where(account_details.act_code.eq(z))
.list(account_details.min_value)).get(0);
double value_max = (selectFrom(account_details)
.where(account_details.act_code.eq(z))
.list(account_details.max_value)).get(0);
// Fetch the cluster number from account table
int cluster_num = (selectFrom(account_details)
.where(account_details.act_code.eq(z))
.list(account_details.cluster)).get(0);
int act_no = (selectFrom(account_details)
.where(account_details.act_code.eq(z))
.list(account_details.actno)).get(0);
String pay_term = (selectFrom(account_details)
.where(account_details.act_code.eq(z))
.list(account_details.pay_term)).get(0);
String pay_term_prob = (selectFrom(account_details)
.where(account_details.act_code.eq(z))
.list(account_details.pay_term_prob)).get(0);
But this is very slow and wants to improve the performance. someone mentioned that we can create a Java class and then add the table into collection . Is there any example where I can refer. I am finding it difficult to put entire code.
I have created a class using below code:
public class Customer {
private String act_code;
private int actno;
private double avg_value;
private String pay_term;
private String pay_term_prob;
private int cluster;
private double min_value;
private double max_value;
public String getact_code() {
return act_code;
}
public void setact_code(String act_code) {
this.act_code = act_code;
}
public int getactno() {
return actno;
}
public void setactno(int actno) {
this.actno = actno;
}
public double getavg_value() {
return avg_value;
}
public void setavg_value(double avg_value) {
this.avg_value = avg_value;
}
public String getpay_term() {
return pay_term;
}
public void setpay_term(String pay_term) {
this.pay_term = pay_term;
}
public String getpay_term_prob() {
return pay_term_prob;
}
public void setpay_term_prob(String pay_term_prob) {
this.pay_term_prob = pay_term_prob;
}
public int cluster() {
return cluster;
}
public void setcluster(int cluster) {
this.cluster = cluster;
}
public double getmin_value() {
return min_value;
}
public void setmin_value(double min_value) {
this.min_value = min_value;
}
public double getmax_value() {
return max_value;
}
public void setmax_value(double max_value) {
this.max_value = max_value;
}
}
Created collection object like this:
Pls provide an reference to add this database table into collection as a next step. then I want to query the collection based on the condition
You are on the right track here!
Every time you access the database to read data there is a computational overhead. So the best option is to access the database only once, at the start of the model. Create all the objects you need, store other data you will need later into Java classes, and then use the Java classes.
My suggestion is to create a Java class for each row in your table, like you have done. And then create a map object - like you have done, but with the key as String and the value as this new object.
Then on model start you can populate this map as follows:
List<Tuple> rows = selectFrom(customer).list();
for (Tuple row : rows) {
Customer customerData = new Customer(
row.get( customer.act_code ),
row.get( customer.actno ),
row.get( customer.avg_value )
);
mapOfCustomerData.put(customerData.act_code, customerData);
}
Where mapOfCustomerData is a linkedHashMap and customer is the name of the table
See the model created in this blog post for more details and an example on using a scenario object to store all the data from the Database in a separate object
Note: The code above is just an example - read this blog post for more details on using the AnyLogic INternal Database
Before using Java classes, try this first: click the "index" tickbox for all columns that you query with a WHERE clause.
I'm trying to join two streams, one from the data collection, one consumes from Kafka.
code snippet
public static void main(String[] args) {
KafkaSource<JsonNode> kafkaSource = ...
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// Kafka messages : {"name": "John"}
final DataStream<JsonNode> dataStream1 = env.fromSource(kafkaSource, waterMark(), "Kafka").rebalance()
.assignTimestampsAndWatermarks(waterMark());
final DataStream<String> dataStream2 = env.fromElements("John", "Zbe", "Abe")
.assignTimestampsAndWatermarks(waterMark());
dataStream1
.join(dataStream2)
.where(new KeySelector<JsonNode, String>() {
#Override
public String getKey(JsonNode value) throws Exception {
return value.get("name").asText();
}
})
.equalTo(new KeySelector<String, String>() {
#Override
public String getKey(String value) throws Exception {
return value;
}
})
.window(SlidingEventTimeWindows.of(Time.minutes(50) /* size */, Time.minutes(10) /* slide */))
.apply(new JoinFunction<JsonNode, String, String>() {
#Override
public String join(JsonNode first, String second) throws Exception {
return first+" "+second;
}
}).print();
env.execute();
}
watermark
private static <T> WatermarkStrategy<T> waterMark() {
return new WatermarkStrategy<T>() {
#Override
public WatermarkGenerator<T> createWatermarkGenerator(
org.apache.flink.api.common.eventtime.WatermarkGeneratorSupplier.Context context) {
return new AscendingTimestampsWatermarks<>();
}
#Override
public TimestampAssigner<T> createTimestampAssigner(TimestampAssignerSupplier.Context context) {
return (event, timestamp) -> System.currentTimeMillis();
}
};
}
After running snippet code, it doesn't have any merged data in the output. Am I going wrong somewhere?
Apache flink version: 1.13.2
The problem is probably related to watermarking. Since you're not using event-time-based timestamps, try changing SlidingEventTimeWindows to SlidingProcessingTimeWindows and see if it then produces results.
The underlying problem is probably a lack of data. The rebalance() on the Kafka stream guarantees that idle partitions won't stall the watermarks unless all partitions are idle. But if this is an unbounded streaming job, unless you have some data that falls after the first window, the watermark won't advance far enough to trigger the first window.
Options:
Send some data with larger timestamps
Configure the Kafka source as a bounded stream by using the .setBounded(...) option on the KakfaSource builder
Stop the job using the --drain option (docs)
The fact that dataStream2 is bounded is also a problem, but I'm not sure how much of one. At best this will prevent any windows after the first one from producing any results (since datastream joins are inner joins).
I have a stream processing application built with spring cloud streams & kafka streams,
this system takes logs from an application and compares them to observations made by another stream processor
and produces a score, the log stream is then split by the score (above & below some threshold).
The topology:
The issue:
So my problem is how to properly implement the "Log best observation selector processor",
There are a finite amount of observations at the moment the log is processed but there may be a lot of them.
So I came up with 2 solutions...
Group & Window log-scored-observations topic by log id and then reduce to get the highest score. (Problem: scoring all observations may take longer then the window)
Emit a scoring completed message after every scoring, join with log-relevant-observations, use log-scored-observations global table & interactive query to check that every observation id is in the global table store, when all ids are in the store map to the observation with the highest score. (Problem: global table does not appear to work when only used for interactive query)
What would be the best way to achieve what I'm trying?
I'm hoping not to create any partition, disk or memory bottleneck.
Everything has unique ids and tuples of relevant ids when the value is joined from log & observation.
(Edit: Switched text description of topology with a diagram & change title)
Solution #2 seems to work, but it emitted warnings because interactive queries takes some time to be ready - so I implemented the same solution with a Transformer:
#Slf4j
#Configuration
#RequiredArgsConstructor
#SuppressWarnings("unchecked")
public class LogBestObservationsSelectorProcessorConfig {
private String logScoredObservationsStore = "log-scored-observations-store";
private final Serde<LogEntryRelevantObservationIdTuple> logEntryRelevantObservationIdTupleSerde;
private final Serde<LogRelevantObservationIdsTuple> logRelevantObservationIdsTupleSerde;
private final Serde<LogEntryObservationMatchTuple> logEntryObservationMatchTupleSerde;
private final Serde<LogEntryObservationMatchIdsRelevantObservationsTuple> logEntryObservationMatchIdsRelevantObservationsTupleSerde;
#Bean
public Function<
GlobalKTable<LogEntryRelevantObservationIdTuple, LogEntryObservationMatchTuple>,
Function<
KStream<LogEntryRelevantObservationIdTuple, LogEntryRelevantObservationIdTuple>,
Function<
KTable<String, LogRelevantObservationIds>,
KStream<String, LogEntryObservationMatchTuple>
>
>
>
logBestObservationSelectorProcessor() {
return (GlobalKTable<LogEntryRelevantObservationIdTuple, LogEntryObservationMatchTuple> logScoredObservationsTable) ->
(KStream<LogEntryRelevantObservationIdTuple, LogEntryRelevantObservationIdTuple> logScoredObservationProcessedStream) ->
(KTable<String, LogRelevantObservationIdsTuple> logRelevantObservationIdsTable) -> {
return logScoredObservationProcessedStream
.selectKey((k, v) -> k.getLogId())
.leftJoin(
logRelevantObservationIdsTable,
LogEntryObservationMatchIdsRelevantObservationsTuple::new,
Joined.with(
Serdes.String(),
logEntryRelevantObservationIdTupleSerde,
logRelevantObservationIdsTupleSerde
)
)
.transform(() -> new LogEntryObservationMatchTransformer(logScoredObservationsStore))
.groupByKey(
Grouped.with(
Serdes.String(),
logEntryObservationMatchTupleSerde
)
)
.reduce(
(match1, match2) -> Double.compare(match1.getScore(), match2.getScore()) != -1 ? match1 : match2,
Materialized.with(
Serdes.String(),
logEntryObservationMatchTupleSerde
)
)
.toStream()
;
};
}
#RequiredArgsConstructor
private static class LogEntryObservationMatchTransformer implements Transformer<String, LogEntryObservationMatchIdsRelevantObservationsTuple, KeyValue<String, LogEntryObservationMatchTuple>> {
private final String stateStoreName;
private ProcessorContext context;
private TimestampedKeyValueStore<LogEntryRelevantObservationIdTuple, LogEntryObservationMatchTuple> kvStore;
#Override
public void init(ProcessorContext context) {
this.context = context;
this.kvStore = (TimestampedKeyValueStore<LogEntryRelevantObservationIdTuple, LogEntryObservationMatchTuple>) context.getStateStore(stateStoreName);
}
#Override
public KeyValue<String, LogEntryObservationMatchTuple> transform(String logId, LogEntryObservationMatchIdsRelevantObservationsTuple value) {
val observationIds = value.getLogEntryRelevantObservationsTuple().getRelevantObservations().getObservationIds();
val allObservationsProcessed = observationIds.stream()
.allMatch((observationId) -> {
val key = LogEntryRelevantObservationIdTuple.newBuilder()
.setLogId(logId)
.setRelevantObservationId(observationId)
.build();
return kvStore.get(key) != null;
});
if (!allObservationsProcessed) {
return null;
}
val observationId = value.getLogEntryRelevantObservationIdTuple().getObservationId();
val key = LogEntryRelevantObservationIdTuple.newBuilder()
.setLogId(logId)
.setRelevantObservationId(observationId)
.build();
ValueAndTimestamp<LogEntryObservationMatchTuple> observationMatchValueAndTimestamp = kvStore.get(key);
return new KeyValue<>(logId, observationMatchValueAndTimestamp.value());
}
#Override
public void close() {
}
}
}
I am using Apache Beam 2.6 to read from a single Kafka topic and write the output to Google Cloud Storage (GCS). Now I want to alter the pipeline so that it is reading multiple topics and writing them out as gs://bucket/topic/...
When reading only a single topic I used TextIO in the last step of my pipeline:
TextIO.write()
.to(
new DateNamedFiles(
String.format("gs://bucket/data%s/", suffix), currentMillisString))
.withWindowedWrites()
.withTempDirectory(
FileBasedSink.convertToFileResourceIfPossible(
String.format("gs://bucket/tmp%s/%s/", suffix, currentMillisString)))
.withNumShards(1));
This is a similar question, which code I tried to adapt.
FileIO.<EventType, Event>writeDynamic()
.by(
new SerializableFunction<Event, EventType>() {
#Override
public EventType apply(Event input) {
return EventType.TRANSFER; // should return real type here, just a dummy
}
})
.via(
Contextful.fn(
new SerializableFunction<Event, String>() {
#Override
public String apply(Event input) {
return "Dummy"; // should return the Event converted to a String
}
}),
TextIO.sink())
.to(DynamicFileDestinations.constant(new DateNamedFiles("gs://bucket/tmp%s/%s/",
currentMillisString),
new SerializableFunction<String, String>() {
#Override
public String apply(String input) {
return null; // Not sure what this should exactly, but it needs to
// include the EventType into the path
}
}))
.withTempDirectory(
FileBasedSink.convertToFileResourceIfPossible(
String.format("gs://bucket/tmp%s/%s/", suffix, currentMillisString)))
.withNumShards(1))
The official JavaDoc contains example code which seem to have outdated method signatures. (The .via method seems to have switched the order of the arguments). I' furthermore stumbled across the example in FileIO which confused me - shouldn't TransactionType and Transaction in this line change places?
After a night of sleep and a fresh start I figured out the solution, I used the functional Java 8 style as it makes the code shorter (and more readable):
.apply(
FileIO.<String, Event>writeDynamic()
.by((SerializableFunction<Event, String>) input -> input.getTopic())
.via(
Contextful.fn(
(SerializableFunction<Event, String>) input -> input.getPayload()),
TextIO.sink())
.to(String.format("gs://bucket/data%s/", suffix)
.withNaming(type -> FileNaming.getNaming(type, "", currentMillisString))
.withDestinationCoder(StringUtf8Coder.of())
.withTempDirectory(
String.format("gs://bucket/tmp%s/%s/", suffix, currentMillisString))
.withNumShards(1));
Explanation:
Event is a Java POJO containing the payload of the Kafka message and the topic it belongs to, it is parsed in a ParDo after the KafkaIO step
suffix is a either dev or empty and set by environment variables
currentMillisStringcontains the timestamp when the whole pipeline
was launched so that new files don't overwrite old files on GCS when
a pipeline gets restarted
FileNaming implements a custom naming and receives the type of the event (the topic) in it's constructor, it uses a custom formatter to write to daily partitioned "sub-folders" on GCS:
class FileNaming implements FileIO.Write.FileNaming {
static FileNaming getNaming(String topic, String suffix, String currentMillisString) {
return new FileNaming(topic, suffix, currentMillisString);
}
private static final DateTimeFormatter FORMATTER = DateTimeFormat
.forPattern("yyyy-MM-dd").withZone(DateTimeZone.forTimeZone(TimeZone.getTimeZone("Europe/Zurich")));
private final String topic;
private final String suffix;
private final String currentMillisString;
private String filenamePrefixForWindow(IntervalWindow window) {
return String.format(
"%s/%s/%s_", topic, FORMATTER.print(window.start()), currentMillisString);
}
private FileNaming(String topic, String suffix, String currentMillisString) {
this.topic = topic;
this.suffix = suffix;
this.currentMillisString = currentMillisString;
}
#Override
public String getFilename(
BoundedWindow window,
PaneInfo pane,
int numShards,
int shardIndex,
Compression compression) {
IntervalWindow intervalWindow = (IntervalWindow) window;
String filenamePrefix = filenamePrefixForWindow(intervalWindow);
String filename =
String.format(
"pane-%d-%s-%05d-of-%05d%s",
pane.getIndex(),
pane.getTiming().toString().toLowerCase(),
shardIndex,
numShards,
suffix);
String fullName = filenamePrefix + filename;
return fullName;
}
}
I need help with understanding of the win_ext window in Esper (CEP). I'm wondering why older (first 2) events still popup on the update-method even though they have been "expired"
public class MyCepTest {
public static void main(String...args) throws Exception{
System.out.println("starting");
MyCepTest ceptest = new MyCepTest();
ceptest.execute();
System.out.println("end");
}
public void execute() throws Exception{
Configuration config = new Configuration();
config.addEventType(MyPojo.class);
EPServiceProvider epService = EPServiceProviderManager.getDefaultProvider(config);
EPAdministrator admin = epService.getEPAdministrator();
EPStatement x1 = admin.createEPL(win);
EPStatement x2 = admin.createEPL(win2);
x1.setSubscriber(this);
x2.setSubscriber(this);
EPRuntime runtime = epService.getEPRuntime();
ArrayList<MyPojo> staffToSendToCep = new ArrayList<MyPojo>();
staffToSendToCep.add(new MyPojo(1, new Date(1490615719497L)));
staffToSendToCep.add(new MyPojo(2, new Date(1490615929497L)));
for(MyPojo pojo : staffToSendToCep){
runtime.sendEvent(pojo);
}
Thread.sleep(500);
System.out.println("round 2...");//why two first Pojos are still found? Shouldn't ext_timed(pojoTime.time, 300 seconds) rule them out?
staffToSendToCep.add(new MyPojo(3, new Date(1490616949497L)));
for(MyPojo pojo : staffToSendToCep){
runtime.sendEvent(pojo);
}
}
public void update(Map<String,Object> map){
System.out.println(map);
}
public static String win = "create window fiveMinuteStuff.win:ext_timed(pojoTime.time, 300 seconds)(pojoId int, pojoTime java.util.Date)";
public static String win2 = "insert into fiveMinuteStuff select pojoId,pojoTime from MyPojo";
}
class MyPojo{
int pojoId;
Date pojoTime;
MyPojo(int pojoId, Date date){
this.pojoId = pojoId;
this.pojoTime = date;
}
public int getPojoId(){
return pojoId;
}
public Date getPojoTime(){
return pojoTime;
}
public String toString(){
return pojoId+"#"+pojoTime;
}
}
I've been puzzled with this for a while and help would be greatly appreciated
See the processing model in docs. http://espertech.com/esper/release-6.0.1/esper-reference/html/processingmodel.html
All incoming insert-stream events are delivered to listeners and subscribers. regardless of your window. A window, if one is in the query at all, defines the subsets of events to consider and therefore defines what gets aggregated, pattern-matched or is available for iteration. Try "select * from MyPojo" for reference. My advice to read up on external time, see http://espertech.com/esper/release-6.0.1/esper-reference/html/api.html#api-controlling-time
Usually when you want "external time window" you want event time to drive engine time.