How to limit memory consumption in Esper Window - complex-event-processing

I have the following query window in my application. I want an update on when the Employee event is coming the first time (a new Employee ) OR if there is any update for an existing Employee. My application is getting a lot of updates and I suspect the below Window is not keeping only the last two records in memory and due to that memory uses increase with time.
Is there any way I can ensure that only last two records are available in Window?
#name('stmtUpdateEmployee') select * from Employee#groupwin(empId)#length(2)
where
prev(1, age) <> age OR
prev(1, dept) <> dept OR
prev(1, address) <> address OR
prev(1, empId) is null;

I suspect the below Window is not keeping only the last two records in
memory and due to that memory uses increase with time
Your suspicion is correct! There will be more than the last two records in memory ... depending on the data, a lot more.
As mentioned in section Esper Reference 5.6.8, the groupwin(empId) line in the select clause ...
select * from Employee#groupwin(empId)#length(2)
creates a separate data window per key value, and the query planner creates a hash index on Employee, essentially resulting in holding onto all the employee events that have a unique empId.
So this will grow with the number of unique employees ... there will be at least 2 * the number of unique employees events (since this is a length of 2, the latest employee event and the previous employee event are in memory.)

Related

How can I query postgres by page?

Assume a table that has many pages and I only want some of it, for example
SELECT * FROM t WHERE...
From pg_class.relpages I know how many pages are there, how can I access some page like
SELECT * FROM t WHERE PAGE_NUMBER=1
Since you mention relpages I assume you are talking about "physical" pages where the data is stored on disk. In general you should mind your own business and let the database server mind its own business. But if you don't want to do that, the you can reference tuples by the hidden system column "ctid", which is a composite consisting of the page number and the slot within the page.
select * from pgbench_accounts where ctid between '(18,0)' and '(18,65535)';
This is the 19th page, as page numbers start at 0. Slot numbers start at 1, but for convenience you are allowed to reference slot 0 as if it were a valid (but empty) slot.
Note that until very recent versions, this will not be fast as it would scan the whole table and filter out the disqualified rows one by one.
For that, you need a sort order. For example
SELECT * FROM t WHERE ...
ORDER BY customer, created_ts;
Make sure that the sort condition is unique. If it isn't, add the primary key at the end.
Then you create an index for the ORDER BY condition:
CREATE INDEX ON t (customer, created_ts);
Then you select the first page of 50 items like this:
SELECT * FROM t WHERE ...
ORDER BY customer, created_ts
LIMIT 50;
You display the page and remember customer and created_ts from the last result row in <latest_cust> and <latest_ts>.
The next page is fetched with
SELECT * FROM t WHERE ...
AND (customer, created_ts) > (<latest_cust>, <latest_ts>)
ORDER BY customer, created_ts
LIMIT 50;
and so on.
This method (it is called “keyset pagination”) is efficient and quite stable in the face of concurrent data modifications.

Can I obtain a postgresql lock that prevents concurrent inserts for a portion of a table?

I have a join table attendees of rooms and users with an extra column that represents a randomly assigned string (from within a pool of strings).
I want to prevent two users entering at the same time from accidentally getting assigned the same string -- so I'd want User B to wait to look up the previous users in the room (and their assigned strings) until User A has completed inserting -- but I don't want to do a table lock since a row insertion for say room_id = 12 doesn't affect an insertion for room_id = 77.
I can't use a unique constraint to solve this, since duplicate strings are possible in the case of a large number of users in a single room (the strings get reassigned evenly once all of them have been used once).*
My guess is doing something like
SELECT room_id, user_id, random_string WHERE room_id = ? FOR UPDATE
isn't going to help because it's not going to prevent User B from doing an insert -- and even if that SELECT FOR UPDATE prevented user B from doing the same call to read the rows corresponding to that room, what happens if both User A and User B are the first ones to join (and there aren't any rows for that room_id to lock)?
Would I use an advisory lock that's keyed on the room_id? Would that still help if I had multiple concurrent writes (e.g. User A should finish first, then User B, then User C)?
*Here's the example of why the strings aren't necessarily unique: say the pool of strings is "red", "blue", "green" -- the first three users that enter are each assigned one of them randomly, then the pool resets. The next three users are also assigned them randomly, so for the six users in the room, exactly two would have "red", two "blue", and two "green".
After a frustrating few days on SO I ended up getting a helpful response on Reddit that solves the problem:
Use a SELECT FOR UPDATE query on the rooms table, locking the row that corresponds to the room until the first user’s transaction completes.

Postgres count(*) optimization idea

I'm currently working on a project involving keeping track of users and their actions with my database (PostgreSQL as the RDMS), and I have run into an issue when trying to perform COUNT(*) on occurrences of each user. What I want is to be able to, efficiently, count the number of times each user appears from every record, and also be able to achieve looking at counts on a particular date range.
So, the problem is how do we achieve counting the total number of times a user appears from the tables contents, and how do we count the total number on a date range.
What I've tried
As you might know, Postgres doesn't support COUNT(*) very well using indices, so we have to consider other ways to reduce the # of records it looks at in order to speed up the query. So my first approach is to create a table to keep track of the number of times a user has a log message associated with them, and on what day (similar to the idea behind a materialized view, but I dont want continually refresh the materialized view with my count query). Here is what I've come up with:
CREATE TABLE users_counts(user varchar(65536), counter int default 0, day date);
CREATE RULE inc_user_date_count
AS ON INSERT TO main_table
DO ALSO UPDATE users_counts SET counter = counter + 1
WHERE user = NEW.user AND day = DATE(NEW.date_);
What this does is every time a new record is inserted into my 'main_table', we update the current users_counts table to increment the records whose date is equal to the new records date, and the user names are the same.
NOTE: the date_ column in 'main_table' is a timestamp so I must cast the new records date_ to be a DATE type.
The problem is, what if the user column value doesn't already exist in my new table 'users_count' for the current day, then nothing is updated.
Here is my question:
How do I write the rule such that we check if a user exists for the current day, if so increment that counter, otherwise insert new row with user, day, and counter of 1;
I also would like to know if my approach makes sense to do, or is there any ideas I am missing that I just haven't thought about. As my database grows, it is increasingly inefficient to perform counting, so I want to avoid any performance bottlenecks.
EDIT 1: I was able to actually figure this out by creating a separate RULE but I'm not sure if this is correct:
CREATE RULE test_insert AS ON INSERT TO main_table
DO ALSO INSERT INTO users_counts(user, counter, day)
SELECT NEW.user, 1, DATE(NEW.date)
WHERE NOT EXISTS (SELECT user FROM users.log_messages WHERE user = NEW.user_);
Basically, an insert happens if the user doesn't already exist in my CACHED table called user_counts, and the first rule above updates the count.
What I'm unsure of is how do I know when which rule is called first, the update rule or insert.. And there must be a better way, how do I combine the two rules? Can this be done with a function?
It is true that postgresql is notoriously slow when it comes to count(*) queries. However if you do have a where clause that limits the number of entries the query will be much faster. If you are using postgresql 9.2 or newer this query will be just as fast as it's in mysql because of index only scans which was added in 9.2 but it's best to explain analyze your query to make sure.
Does my solution make sense?
Very much so provided that your explain analyze show that index only scans are not being used. Trigger based solutions like the one that you have adapted find wide usage. But as you have realized the problem with the initial state arises (whether to do an update or an insert).
which rule is called first
Multiple rules on the same table and same event type are applied in
alphabetical name order.
from http://www.postgresql.org/docs/9.1/static/sql-createrule.html
the same applies for triggers. If you want a particular rule to be executed first change it's name so that it comes up higher in the alphabetical order.
how do I combine the two rules?
One solution is to modify your rule to perform an upsert (Look right at the bottom of that page for a sample upsert ). The other is to populate the counter table with initial values. The trick is to create the trigger at the same time to avoid errors. This blog post explains it really well.
While the initial setup will be slow each individual insert will probably be faster. The two opposing factors being the slowness of a WHERE NOT EXISTS query vs the overhead of catching an exception.
Tip: A block containing an EXCEPTION clause is significantly more
expensive to enter and exit than a block without one. Therefore, don't
use EXCEPTION without need.
Source the postgresql documentation page linked above.

PostgreSQL: Returning ordered rows after a specific ID

Scenario:
I am displaying a table of records. It initially displays the first 500 with "show more" at the bottom, which returns the next 500.
Issue:
If between initial display and clicking "show more" 1 record is added, that will cause "order by date, offset 500, limit 500" to overlap by 1 row.
I'd like to "order by date, offset until 'id of last row shown', limit 500"
My row IDs are UUIDs. I am open to alternative approaches that achieve the same result.
If you can order by ID, you can paginate using
where id > $last_seen_id limit 500
but that's not going to be useful where you're sorting by date.
Sort stability!
I really hope that "date" actually means "timestamp" though, otherwise your ordering will be unstable and you can miss rows in pagination; you'll have to order by date, id to get stable ordering if it's really a date, and should probably do so even for timestamp.
State on client
One option is to push the state out to the client. Have the client remember the last-seen (date,id) tuple, and use:
where date > $last_seen_date and id > $last_seen_id limit 500
Cursors
Do you care about scalability? If not, you can use a server-side cursor. Declare the cursor for the full query, without the LIMIT. Then FETCH chunks of rows as requested. To do this your app must have a way to consistently bind a connection to a specific user's requests, though, and not to reset that connection or return it to the pool between requests. This might not be practical with your pool/framework, but is probably the best solution if you can do it.
Temp tables
Another even less scalable option is to CREATE TABLE sessiondata.myuser_myrequest_blah AS SELECT .... then paginate that table. It's guaranteed not to change. This avoids the difficulty of needing to keep a consistent connection across requests, but will have a very slow first-request response time and is completely impractical for large user counts or large amounts of data.
Related questions
Handling paging with changing sort orders
Using "Cursors" for paging in PostgreSQL
How to provide an API client with 1,000,000 database results?
i think you can use a subquery in the where to accomplish this.
e.g. given you're paginating through a users table, and you want the records after a given user:
SELECT *
FROM users
WHERE created_at > (
SELECT created_at
FROM users
WHERE users.id = '00000000-1111-2222-3333-444444444444'
LIMIT 1
)
ORDER BY created_at DESC limit 5;

Pagination logic in Mainframe CICS

Here is my requirement.
Front (Client) end will do a search based on predefined conditions (for instance: customer id, account number, first name, last name, etc). I need to get the data corresponding to this request from a db2 database and send it back to them (Server). We use CICS channels and containers to pass requests and responses between the Client and Server.
Front end needs the data ordered by: Receive date descending, Customer id Ascending, Account number Ascending. Data are fetched in pages of 500 records. For example, if for a search request from front end would retrieve 50,000 records from the db2 database, we need to return this data in 500 record "pages". For pagination concept, we use the field security deposit number which is primary key to our database but the sorting order is not based on this field.
I would like to know whether we can use scrollable cursor logic in CICS to implement pagination.
Please note that I do not prefer to go for internal array bubble sort to send the data in response as it would degrade performance. I like to do it via query logic.any thoughts?
Example (Initial Front end input request):
Customer id : A
First time request (To identify whether it is first time or next or previous request for pagination)
First security deposit number : 0
Last security deposit number : 0
Since this is first time request, both this field will be having zero from front end and we need to retrieve records from database based on condition of security deposit > 0
Db2 database:
There are 700 records for this criteria
Mainframe response for first time:
We will send the first 500 records
Front end will then send request for getting next set of records which will contain:
Customer id: A
Next request
First security deposit numbr: 0
Last security deposit number : 17980
So for this detail, if I query my datbase based on security deposit number > 17980, it may result in duplicate records listing in the screen once again since our sorting order in database is not based on security deposit number
How to impelement this logic??
Many Client/Server applications in an IBM Mainframe environment involve psuedo conversational CICS transactions.
If you are using CICS in psueudo conversational mode it
is not possible for the Server to hold cusors when it RETURNs to the Client. Therefore scrollable cusors
are of little use in this environment. So to answer your basic question: No scrollable cursors cannot be used here.
The "trick" here is to create an SQL predicate in the Server that is restartable. It will then pick up rows in the correct order from any given
stating point. When the Client calls your Server it must pass all of the positioning information to your Server.
Typically, on a first call from a Client all of the positioning values are set to cause the cursor to
position itself starting with what must be the the first row. The Server then pulls in a "page" worth of data
and returns it to the Client. On the next page forward request the Client sets these positioning values to
the last row it displayed and calls the Server for the next "page" of data.
In your situation I would assume that the page forward cursor would look something like this, all the
variables prefixed with RESTART... are what the Client must provide to the Server to start the cursor
in the correct position.
DECLARE CURSOR Page-forward FOR
SELECT Receive_Date, Customer_id, Account_Nbr, Security_Dep_Id
FROM Table_Name
WHERE ( (Receive_Date < :RESTART-RCV-DT)
OR (Receive_Date = :RESTART-RCV-DT AND
Customer_Id > :RESTART-CUSTOMER-ID)
OR (Receive_Date = :RESTART-RCV-DT AND
Customer_Id = :RESTART-CUSTOMER-ID AND
Account_Nbr > :RESTART-ACCT-NBR)
OR (Receive_Date = :RESTART-RCV-DT AND
Customer_Id = :RESTART-CUSTOMER-ID AND
Account_Nbr = :RESTART-ACCT-NBR AND
Security_Dep_Id > :RESTART-SEC-DEP-ID))
ORDER BY 1 DESC, 2 ASC , 3 ASC, 4 ASC
For the initial call the Client would have passed something like '9999-12-31' as the RESTART-RCV-DT, zero
for the RESTART-CUSTOMER-ID, RESTART-ACCT-NBR and SEC-DEP-ID (assuming these are all numeric). If you look at
the cursor predicate carefully you can verify that there cannot be any rows prior to these values - therefore this
will return the first page of data. If the Client needs to page forward after this, it must tell the Server to start
with the next row after the last one it received. To do this it would populate the RESTART... variables with
the values from the last row on the page it just
displayed. This process will drive the cursor selects forward one page at a time.
When paging up, the process is reversed (you will need a second cursor to support this, and the Client needs to tell you which direction to page: Forward or Back). The Client
will need to populate the RESTART variables with the first row it recieved from the Server. The trick
for the Server on a page up request is to return the data
to the Client in reverse order. You may have to populate the data page passed back
to the Client in reverse order (ie. put the first row retrieved into the last row of the paging area shared between
the Client and the Server). The page backward cursor would look something like:
DECLARE CURSOR Page-backward FOR
SELECT Receive_Date, Customer_id, Account_Nbr, Security_Dep_Id
FROM Table_Name
WHERE ( (Receive_Date > :RESTART-RCV-DT)
OR (Receive_Date = :RESTART-RCV-DT AND
Customer_Id < :RESTART-CUSTOMER-ID)
OR (Receive_Date = :RESTART-RCV-DT AND
Customer_Id = :RESTART-CUSTOMER-ID AND
Account_Nbr < :RESTART-ACCT-NBR)
OR (Receive_Date = :RESTART-RCV-DT AND
Customer_Id = :RESTART-CUSTOMER-ID AND
Account_Nbr = :RESTART-ACCT-NBR AND
Security_Dep_Id < :RESTART-SEC-DEP-ID))
ORDER BY 1 ASC, 2 DESC , 3 DESC, 4 DESC
As has been pointed out in other answers, this type of paging process does not manage or detect concurrent
updates to the database that may occur duing paging transactions. That is another topic for another day...
Developing Restartable Cursors
The the key to building a paging Server is to develop a cursor that is restartable from a set of values received
from a Client transaction. This leaves control of cursor positioning and direction with the Client.
It also means the Client must receive all critical positioning data from the Server even though
the Client might not actually
use these data for any other purpose (e.g. From your question I got the impression that the Client may not require
the Security Deposit Id except to supply as a positioning parameter for your Server)
To build a paging Server you need to know
what the required sorting order of the data are (e.g. Receive Date Descending then Customer Id Ascending then
Account Number Ascending).
You also need know the set of data that uniquely identify a row
returned by the cursor. In your case that would be the Security Deposit Id (this is the primary key for the
table you are selecting from so it must be unique for each and every row in that table). Knowing this you then build a
cursor predicate (the stuff in the WHERE clause) that will return data needed by the Client in the required sort order that
also includes
the full positioning key (i.e. Security Deposit Id). In the event that two or more returned rows may contain identical data if
the final positioning key were elimiminated makes it important that the positioning key be included as a sort condition.
It doesn't matter if it is ascending or descending, but it needs to be included on the sort to ensure consistent
order of data retrieval.
A fairly simple formula may be followed to build the predicate for a restartable cusor needed to
support paging Servers. Basically this is a cascade of "OR" clauses connecting a series of "AND" clauses
that become progressively more selective following the sort order required by the Client and end up with the positioning
key.
To see how this works consider how the query for your Server might be developed...
Start with the column from the sort order that changes least often...
SELECT ...
FROM ...
WHERE Receive_Date < restart value
This will retrieve all rows prior to the specified restart Receieve date regardless of what the other
column restart values are (e.g. Customer ID's can range from minimum to maximum values, as long as the Receive Date
is less than any Receive Date "seen" so far). Since this column only changes value after all subortinate sort columns values
have been exausted you can be sure that this does not pick up any rows prior to the full restart key.
But what about those rows that occur on the same date as the restart request but have a
larger Customer Id? These can be picked up with....
SELECT ...
FROM ...
WHERE Receive_Date = restart value AND
Customer_id > restart value
What about those where the Receive Date and Customer Id are the same as the restart key but have
a larger Account Number? These can be picked up with...
SELECT ...
FROM ...
WHERE Receive_Date = restart value AND
Customer_Id = restart value AND
Account_Nbr > restart value
Continue this pattern until the full restart key has been processed. Notice that the inequality
signs are determined by the sort order. Use < when the column is sorted Descending and > when Ascending.
Also notice that the SELECT and FROM clauses
are exactly the same for each query - which means you can put them all together using OR conjuctions...
SELECT Receive_Date, Customer_id, Account_Nbr, Security_Dep_Id
FROM Table_Name
WHERE ( (Receive_Date < :RESTART-RCV-DT)
OR (Receive_Date = :RESTART-RCV-DT AND
Customer_Id > :RESTART-CUSTOMER-ID)
OR (Receive_Date = :RESTART-RCV-DT AND
Customer_Id = :RESTART-CUSTOMER-ID AND
Account_Nbr > :RESTART-ACCT-NBR)
OR (Receive_Date = :RESTART-RCV-DT AND
Customer_Id = :RESTART-CUSTOMER-ID AND
Account_Nbr = :RESTART-ACCT-NBR AND
Security_Dep_Id > :RESTART-SEC-DEP-ID))
ORDER BY 1 DESC, 2 ASC , 3 ASC, 4 ASC
There you go... a restartable cursor for forward paging. Construction of the cursor for backward paging follows a similar pattern, just flip the
sort orders and repeat.
A simplistic approach: Write your SQL to retrieve data according to your criteria, in the sort order you specify. Then only retrieve the keys to the rows you want. Save the keys somewhere you will have access to upon subsequent invocations of your transaction. Look into multi-row select in DB2. Also understand pseudo-conversational programming techniques in CICS.
And now we get to the design implications Bill Woodger mentions, that you do not specify in your question, and which are the reason I'm just hitting the high points of a simplistic approach.
If changes to your result set occur between one invocation and the next, your results will not reflect those changes. You must decide if this is important.
You mention a "front end" but do not specify what it is. If it is a BMS application, you may be able to save the keys in your commarea or in a container. If your front end is a distributed application invoking your transactions via CICS Web Services or CICS Web Support or MQ or raw sockets or whatever, you must design a mechanism to store those keys such that you can uniquely retrieve them — perhaps by sending a contrived key back to the distributed application which it must supply upon subsequent invocations. Then you must have some process to clean up your key store.
Creating a solution to your problem that is unique in your IT shop is not something to be done in isolation. You must involve others who will be tasked with maintaining your application, there may be a group external to your project tasked with making such decisions, there may be infrastructure issues with your solution.
So this isn't so much as an answer to your question as it is an elaboration upon why you may not get an answer, or at least the answer you seem to desire.