What is the relationship between #timestamp and "over window:time"? - drools

I'm using Drools 6.2.0.Final and I need to process a set of event using window:time. Each event has a date's field.
public class Event {
private Long id;
private Date date;
...
And in my drl:
declare org.drools.examples.broker.events.Event
#role( event )
#timestamp (date)
end
rule "test"
when
$count: Number() from accumulate (
$e: Event() over window:time(40s) from entry-point "stream" ,
count($e))
then
System.out.println("Count:" + $count);
end
e1 (2015-01-01 00:00:00)
e2 (2015-01-01 00:00:20)
e3 (2015-01-01 00:00:40)
e4 (2015-01-01 00:01:00)
Scenario 1: Using realtime and inserting a set a event simultaneously.
session.getEntryPoint("stream").insert(e1);
session.fireAllRules();
session.getEntryPoint("stream").insert(e2);
session.fireAllRules();
session.getEntryPoint("stream").insert(e3);
session.fireAllRules();
session.getEntryPoint("stream").insert(e4);
session.fireAllRules();
Scenario 2: Using pseudo, inserting a set a event simultaneously and adding to clock the event's offset.
session.getEntryPoint("stream").insert(e1);
session.fireAllRules();
clock.advanceTime(20, TimeUnit.SECONDS);
session.getEntryPoint("stream").insert(e2);
session.fireAllRules();
clock.advanceTime(40, TimeUnit.SECONDS);
session.getEntryPoint("stream").insert(e3);
session.fireAllRules();
clock.advanceTime(60, TimeUnit.SECONDS);
session.getEntryPoint("stream").insert(e4);
session.fireAllRules();
The second scenario runs fine. But I have some questions:
What is the relationship between #timestamp and "over window:time"?
What happends if a need to insert unsorted events (by timestamp) in Working memory?
Can I use the timestamp denoted by my event instead of the timestamp denoted by the insert's time?
Thanks.
UPDATE 1
#timestamp, #duration etc. are only used to relate events together (e.g A before B, A meets B, and so on), and that they do not relate the event to the clock. But "over window:time" is based on Drools's clock. The window's time uses the moment that the event was inserted in working memory to match the rule. You need to use Drools stream mode.

What is the relationship between #timestamp and "over window:time"? A window with length d selects an event containing the timestamp x if now.d < x <= now.
What happens if I need to insert unsorted events (by timestamp) in Working memory? You shouldn't do that unless the Engine is in "cloud" mode. The timestamp is basically nothing but a long value, and can be evaluated all the same. However, you should think carefully before doing this, as this may produce results that differ from an execution where inserts are done in the proper order.
Can I use the timestamp denoted by my event instead of the timestamp denoted by the insert's time? It appears that you are doing just that, due to #timestamp(date) in the DRL declare statement.

Related

Index for TIMESTAMPTZ and function immutability

We have a structure similar to the following:
create table company
(
id bigint not null,
tz text not null
);
create table company_data
(
company_id bigint not null,
ts_tz timestamp with time zone not null
);
The tables are simplified.
Fiddle with sample data here: SQL Fiddle
Every company has a fixed TZ. So, when we need to extract some information from company_data we use a query similar to the following:
select
cd.company_id,
cd.ts_tz at time zone c.tz
from company_data cd
join company c on c.id = cd.company_id;
We also have a function to get company tz:
create or replace function tz_company(f_company_id bigint) returns text
language plpgsql
as
$$
declare
f_tz text;
begin
select c.tz from company c where c.id = f_company_id into f_tz;
return f_tz;
end;
$$;
And another to transform a ts in a date applying a tz:
create or replace function tz_date(timestamp with time zone, text) returns date
language plpgsql
immutable strict
as
$$
begin
return ($1 at time zone $2) :: date;
end;
$$;
The problem we are having now is that company_data (and other similar tables) is a large and frequently used table. The majority of the SELECTs in that table performs filtering using a DATE.
For example:
select cd.company_id,
cd.ts_tz at time zone tz_company(cd.company_id)
from company_data cd
where tz_date(cd.ts_tz, tz_company(cd.company_id)) >= '2019-08-20'
and tz_date(cd.ts_tz, tz_company(cd.company_id)) <= '2019-08-22';
So, to speed up queries, we need to add an index in the company_data.ts_tz column. The only way for doing this that we found was the following:
create index idx_company_data_ts_tz on company_data
(((company_data.ts_tz at time zone tz_company(company_data.company_id))::date));
For this to work, we need to make the tz_company function immutable.
Some other problems (and ideas) emerged:
1 - The version of the query using tz_date function does not use index.
Not uses index:
explain analyse
select cd.company_id,
cd.ts_tz at time zone tz_company(cd.company_id)
from company_data cd
where tz_date(cd.ts_tz, tz_company(cd.company_id)) >= '2019-08-20'
and tz_date(cd.ts_tz, tz_company(cd.company_id)) <= '2019-08-22';
Uses index:
explain analyse
select cd.company_id,
cd.ts_tz at time zone tz_company(cd.company_id)
from company_data cd
where (cd.ts_tz at time zone tz_company(cd.company_id))::date >= '2019-08-20'
and (cd.ts_tz at time zone tz_company(cd.company_id))::date <= '2019-08-22';
Why that happens?
2 - We know that, in theory, tz_company should not be immutable, at most stable. But, the company tz is an information that should not change, ever. Yes, it could happen, but it is improbable. In the past three years, we never change the tz of any company. So, is still a problem for tz_company to be immutable? If it is, how could we rewrite the index? Note that a single SELECT could bring information of more than one company and mix different timezones.
3 - Because of the complexity of dealing with indexes in a timestamptz column we consider to add another column in every table that has a ts_tz. This new column would be a date with tz already applied. Is this a good approach?
Besides, we need to apply tz before casting because every client (company) selects only dates to filter and this dates are locale aware (tz aware).
EDIT 1:
The queries used are only for demonstration. But a requirement is that the client sees the timestamps in the timezone where the event has occurred, this is an important requirement. We deal with logistics operations in Brazil and Brazil itself has four different timezones across the country.
A holding could own different companies and every company could be in a different timezone.
So, a lot of queries deals with different companies at different timezones and applying some date filtering. Today, our backend returns all data ready to display, with timezone applied and this would be difficult to change.
What we want to achieve, is an easy and performative way of dealing with those timestamptz columns: applying filter by date (tz aware) and using indexes to speedup queries.
1 - That's because tz_date is not marked as immutable. It is safe to mark it as immutable if postgres allows to create an index on the same expression as in the body of the function -- it only would allow to do it on an immutable expression. Some postgres date-time manipulation functions and type casts are immutable, some aren't. BTW I'm not sure what happens to an index if at time zone operator breaks its immutability contract when tzdata is changed -- that happens quite often on postgres or OS upgrade, depending on the settings.
2 - That's a very dangerous approach, the index becomes corrupted if you change the data. You may lose data. If you absolutely need this pseudo-immutable function I would strongly recommend to add a trigger that disallows deletes, truncates and updates of company.tz. If you ever need to change the time zone data, drop the index first.
3 - The key question is whether you happen to query data across multiple companies?
a) If you do, it's only of numerological sense. 2011-09-13 events from Niue (UTC-11) and 2019-09-13 events from New Zealand (UTC+13) can never happen at the same time. The only common property of these events is they happened on Friday the 13th. That's only notation, it never was 2019-09-13 in both countries at the same time. So please make sure your queries really make sense. In this unlikely case denormalization of the date notation as a separate timestamp without time zone column would make sense, as you're filtering by notation of time, not by the moment of time. I would recommend a trigger to populate it.
b) All your queries are single-company. In this case I would create a plain index on columns only with no expressions and create a function and make queries like this:
create index on company_data(company_id, ts_tz);
create function midnight_at_company(p_date date, p_company_id bigint) strict returns timestamp with time zone as $$
select p_date::timestamp at time zone tz from company where id = p_company_id;
$$ language sql;
-- put your company id instead of $1
explain analyse
select cd.company_id,
cd.ts_tz at time zone tz_company(cd.company_id)
from company_data cd
where company_id = $1
and cd.ts_tz >= midnight_at_company('2019-08-20', $1)
and cd.ts_tz < midnight_at_company('2019-08-23', $1); --note exact `<`, not `<=`
I would standardize all the time zones into one calling it database or server time. I understand that companies are in different places, but that is not a good reason to have timezones all over your data. Using this method will eliminate the need to have a time zone reference table. When you pull the data from any one of these companies write your code to take into account the server time zone so that it reads in your local time.
This will eliminate tons of potential confusion. This is a method used across the world, that is why data timestamps in most APIs only have one timezone.
In response to Edit:
Hi #Luiz
Let me start with there is no right or wrong answer its whatever think works best. In my case I am of the opinion that the front end view and the data should be managed some what separate. On the data side as per this topic I would handle all date stamps using server time. The need to view data one way or another is a front end issue.
In the case of your requirement I would either hard code a js switch like such.
switch("CampanyA") {
case "CompanyA":
return Timezone EST...
// code block
break;
case "CompanyB" :
// code block
break;
default:
// code block
}
or if there are to many companies for a hard code to be handling I would make a table with the "Company ID", "Company name", and "Time Zone Code". Do not link this table to your data. You should add the "Company ID" to the main table with events that have the server time zone.
Use the table with the company time zone codes to populate your look up filter that will be used to run your query. When your script event handler reacts to the drop down menu it will save the current TM Zone code associated with that company and use the value when trying to display the time zone in accordance to your requirement. I would also force your code to load data as async (1000 records or so every few mil seconds) instead of all at once. This will vastly increase performance and the user will not be able to tell that their data is still loading.
This efforts will let you manipulate the time zone to meet the current and future requirements that might come up.
I think the current schema that u are using for your application is not the best for such a problem.
You would have a lot of problems saving different timezones at the same table.
Use UTC, only use UTC on the DB/Schema level, you can set that in Postgres conf also.
Depending on the application, you could send back UTC dates and convert them to their current local time in javascript/server Side. If that's not possible have one place where the user specifies their current UTC offset and then right before you display the date/time convert it to their time.
This is going to make your life super simple and u can achieve great performance on the Query level as u now would have a performant DB Schema, the SQL functions you have makes no sense as you can achieve much better performance just by using indexing in DB.
So as per your specific requirements, I would have the schema as u have with some additions, I would index the id for the table company and would store all the data in UTC for the timestamp in table company_data.
if the company data is being requested we fetch the Timezone(Text) from the company table, using this data we can have the backend code/JS do the timezone change magic.
we have a limited amount of timezones, you can ideally have those set in config to make the lookup easier and faster.

Count and Time window in Esper EPL

I have the following use case, which I'm trying to write in EPL, without success. I'm generating analytics events of different types, generated in different intervals (1min, 5min, 10min, ...). In special kind of analytics, I need to collect 4 specific
Analytics events (from which I will count another analytic event) of different types, returned every interval (1min, 5min, 10min, ...). The condition there is, that on every whole interval, e.g., every whole minute 00:01:00, 00:02:00 I want to have returned either 4 events or nothing if the events don't arrive in some slack period after (e.g., 2s).
case 1: events A,B,C,D arrive at times 00:01:00.500, 00:01:00.600, 00:01:00.700, 00:01:00.800 - right after fourth event arrives to esper, the aggregated event with all 4 events is returned
case 2: slack period is 2seconds, events A,B,C,D arrives at 00:01:00.500, 00:01:00.600, 00:01:00.700, 00:01:02.200 - nothing is arrived, as the last event is out of the slack period
You could create a trigger event every minute like this:
insert into TriggerEvent select * from pattern[timer:schedule(date:'1970-01-01T00:00:00.0Z', period: 1 minute, repetitions: -1)]
The trigger that arrives every minute can kick off a pattern or context. A pattern would seem to be good enough. Here is something like that:
select * from pattern [every TriggerEvent -> (a=A -> b=B -> c=C -> d=D) where timer:within(2 seconds)]

How to get the time difference in talend?

How to get the difference in time by comparing with the previous value and getting the result .Say for example
There are
2017-01-01 13:00:00
2017-01-01 13:15:00
I need the difference as 15 minutes after finding the difference,How to do it?
Firstly, you'll have to use TalendDate.diffDate(column1,column2,"pattern") to get the time difference.
Then, if you want to compare current value with previous one (in the same column), you can set a sequence on your flow, it will help you identify which one is the previous value. Then, you'll just have to read twice your flow, and have an inner join between current sequence and current sequence -1 to get the currentDate and the previous Date.
First subjob :
YourFlow -> tMap -> tHashOutput
In tMap, add a new "sequence" column to your field and use Numeric.sequence("s1",1,1).
This way all lines will have an ID.
Then, read twice your Hash , and join flows on "sequence - 1"
tHashInput_1----|
|--tMap--->Output
tHashInput_2----|
Put the TalendDate.diffDate() method in the output, using the two Dates fields.
Here is an alternative :
Start defining starting talend job execution time, this way (here in a tJava, but you can also use tSetGlobalVar component) :
globalMap.put("startDate", TalendDate.getDate("CCYY/MM/DD hh:mm:ss"));
The following code is used later in the job inside a tJava :
String endDate = TalendDate.getDate("CCYY/MM/DD hh:mm:ss");
long executionTime = format.parse(endDate).getTime() - format.parse(((String)globalMap.get("startDate"))).getTime();
System.out.println("Execution Time : "+(executionTime/(60*60*1000))+" Hour(s) "+(executionTime/(60*1000)%60)+" Minute(s) "+(executionTime/1000%60)+" second(s).");

PostgreSQL - Have one serial type column but reset every day (unique combination with other date type column)

i need to have this table that will have a serial type column in my PostgreSQL database that will reset every day and will be unique combination with other date type column.
For example today i insert 2 rows
SerialId, Date
1, '08.12.2016'
2, '08.12.2016'
But tomorrow the next insert should be with SerialId = 1 and tomorrows date
1, '09.12.2016' ...
The problem is that not only one users makes inserts in this table and i can't have some global variable in my application that will count and reset every day.
Any ideas?
Thanks in advance.
I would just let the sequence keep running, but if you really want to reset it, you could define a cron job that runs at midnight and issues
ALTER SEQUENCE ... RESET;
If the application is really busy around midnight, you have a race condition there, because there is no guarantee that the sequence will be reset precisely at midnight, but if there is not much traffic at this time you might get away with it.

Postgres: using timestamps for pagination

I have table with created (timestamptz) property. Now, i need to create pagination based on timestamp, because while user is watching first page, new items could be submitted into this table, which will make data inconsistent in case if i'll use OFFSET for pagination.
So, the question is: should i keep created type as timestamptz or it's better to convert it into integer (unix, e.g. 1472031802812). If so, is there any disadvantages? Also, atm i have now() as default value in created - is there alternative function to create unix timestamp?
Let me rewrite things from comments to my answer. You want to use timestamp type instead of integer simply because that's exactly what it was designed for. Doing manual convertions between timestamp integers and timestamp objects is just a pain and you gain nothing. And you will need it eventually for more complex datetime based queries.
To answer a question about pagination. You simply do a query
SELECT *
FROM table_name
WHERE created < lastTimestamp
ORDER BY created DESC
LIMIT 30
If it is first query then you set say lastTimestamp = '3000-01-01'. Otherwise you set lastTimestamp = last_query.last_row.created.
Optimization
Note that if the table is big then ORDER BY created DESC might not be efficient (especially if called parallely with different ranges). In this case you can use moving "time windows", for example:
SELECT *
FROM table_name
WHERE
created < lastTimestamp
AND created >= lastTimestamp - interval '1 day'
The 1 day interval is picked arbitrarly (tune it to your needs). You can also sort results in the app.
If results is not empty then you update (in your app)
lastTimestamp = last_query.last_row.created
(assuming you've done sorting, otherwise you take min(last_query.row.created))
If results is empty then you repeat the query with lastTimestamp = lastTimestamp - interval '1 day' until you fetch something. Also you have to stop if lastTimestamp becomes to low, i.e. when it is lower then any other timestamp in the table (which has to be prefetched).
All of that is under some assumptions for inserts:
new_row.created >= any_row.created and
new_row.created ~ current_time
The distribution of new_row.created is more or less uniform
Assumption 1 ensures that pagination results in consistent data while assumption 2 is only needed for the default 3000-01-01 date. Assumption 3 is to make sure that you don't have big empty gaps when you have to issue many empty queries.
You mean something like this?
select extract(epoch from now())::integer as unix_time