Understanding MON$STAT_ID in the Firebird monitoring tables - firebird

I posted a few weeks back inquiring about the firebird DB and how to monitor it. Since then I have come up with a nifty script that monitors all of the page reads/writes/fetches/marks. One of the columns I am monitoring is the MON$STAT_ID and the MON$STAT_GROUP fields. This prints out a nice number for me; however, I have no way to correlate and understand what exactly it is. I thought printing out the MON$STAT_GROUP would help but it has yet to assist me in any way...
I have also looked into the RDB$ commands but have found very limited documentation to see if they might assist me in monitoring my database.
So I decided to come here and inquire first off whether I am monitoring my database in a way that others can view the data from page reads/writes/fetches/marks and make an intelligent decision on whether or not the database is performing as expected.
Secondly, would adding RDB$ commands to my script add anything to the value of the data that I will be giving our database folks?
Lastly, and maybe most importantly, is there anyway to correlate the MON$STAT_ID fields to an actual table in the database to understand when something is going on that should not be? I currently am monitoring the database every minute which may be to frequent, but I am getting valid data out. The only question now is how to interpret this data. Can someone give me advice on methods they use/have used in the past that have worked for them?
(NOTE: Running firebird 2.1)

The column MON$STAT_ID in MON$IO_STATS (and MON$RECORD_STATS and MON$MEMORY_USAGE) is the primary key of the record in the monitoring table. Almost all other monitoring tables include a MON$STAT_ID to point to these statistics: MON$ATTACHMENTS, MON$CALL_STACK, MON$DATABASE, MON$STATEMENTS, MON$TRANSACTIONS.
In other words: the statistics apply on the database, attachment, transaction, statement or call level (PSQL executes). The statistics tables contain a column called MON$STAT_GROUP to discern these types. The values of MON$STAT_GROUP are described in RDB$TYPES:
0 : DATABASE
1 : ATTACHMENT
2 : TRANSACTION
3 : STATEMENT
4 : CALL
Typically the statistics of level 0 contain all from level 1, level 1 contains all from level 2 for that attachment, level 2 contains all from level 3 for that transaction, level 3 contains all from level 4 for that statement.
As there might be data processed unrelated to the lower level, or a specific attachment, transaction or statement handle has already been dropped, the numbers of the lower level do not necessarily aggregate to the entire number of the higher level.
There is no way to correlate the statistics to a specific table (as this information isn't table related, but - simplified - from executing statements which might cover multiple tables).
As I also commented, I am unsure what you mean with "RDB$ commands". But I am assuming you are talking about RDB$GET_CONTEXT() and RDB$SET_CONTEXT(). You could use RDB$GET_CONTEXT() to obtain the current connection (SESSION_ID) and transaction id (TRANSACTION_ID). These values values can be used for MON$ATTACHMENT_ID and MON$TRANSACTION_ID in the monitoring tables. I don't think the other variables in the SYSTEM namespace are interesting, and those in USER_SESSION and USER_TRANSACTION are all user-defined (and initially those namespaces are empty).
It is far easier to use the CURRENT_CONNECTION and CURRENT_TRANSACTION context variables within a statement. As documented in doc\README.monitoring_tables.txt in the Firebird installation:
System variables CURRENT_CONNECTION and CURRENT_TRANSACTION could be used to select data about the current (for the caller) connection and transaction respectively. These variables correspond to the ID columns of the appropriate monitoring tables.
Note: my answer is based on Firebird 2.5.

To present statistics by specific tables I use this SQL (FB 3)
select t.mon$table_name,trim(
case when r.mon$record_seq_reads>0 then 'Non index Reads: '||r.mon$record_seq_reads else '' end||
case when r.mon$record_idx_reads>0 then ' Index Reads: '||r.mon$record_idx_reads else '' end||
case when r.mon$record_inserts>0 then ' Inserts: '||r.mon$record_inserts else '' end||
case when r.mon$record_updates>0 then ' Updates: '||r.mon$record_updates else '' end||
case when r.mon$record_deletes>0 then ' Deletes: '||r.mon$record_deletes else '' end)
from MON$TABLE_STATS t
join mon$record_stats r on r.mon$stat_id=t.mon$record_stat_id
where t.mon$table_name not starting 'RDB$' and r.mon$stat_group=2
order by 1

Related

Statistics of all/many tables in FileMaker

I'm writing a kind of summary page for my FileMaker solution.
For this, I have define a "statistics" table, which uses formula fields with ExecuteSQL to gather info from most tables, such as number of records, recently changed records, etc.
This strangely takes a long time - around 10 seconds when I have a total of about 20k records in about 10 tables. The same SQL on any database system shouldn't take more than some fractions of a second.
What could the reason be, what can I do about it and where can I start debugging to figure out what's causing all this time?
The actual code is, like this:
SQLAusführen ( "SELECT COUNT(*) FROM " & _Stats::Table ; "" ; "" )
SQLAusführen ( "SELECT SUM(\"some_field_name\") FROM " & _Stats::Table ; "" ; "" )
Where "_Stats" is my statistics table, and it has a string field "Table" where I store the name of the other tables.
So each row in this _Stats table should have the stats for the table named in the "Table" field.
Update: I'm not using FileMaker server, this is a standalone client application.
We can definitely talk about why it may be slow. Usually this has mostly to do with the size and complexity of your schema. That is "usually", as you have found.
Can you instead use the DDR ( database design report ) instead? Much will depend on what you are actually doing with this data. Tools like FMPerception also will give you many of the stats you are looking for. Again, depends on what you are doing with it.
Also, can you post your actual calculation? Is the statistic table using unstored calculations? Is the statistics table related to any of the other tables? These are a couple things that will affect how ExecuteSQL performs.
One thing to keep in mind, whether ExecuteSQL, a Perform Find, or relationship, it's all the same basic query under-the-hood. So if it would be slow doing it one way, it's going to likely be slow with any other directly related approach.
Taking these one at a time:
All records count.
Placing an unstored calc in the target table allows you to get the count of the records through the relationship, without triggering a transfer of all records to the client. You can get the value from the first record in the relationship. Super light way to get that info vs using Count which requires FileMaker to touch every record on the other side.
Sum of Records Matching a Value.
using a field on the _Stats table with a relationship to the target table will reduce how much work FileMaker has to do to give you an answer.
Then having a Summary field in the target table so sum the records may prove to be more efficient than using an aggregate function. The summary field will also only sum the records that match the relationship. ( just don't show that field on any of your layouts if you don't need it )
ExecuteSQL is fastest when it can just rely on a simple index lookup. Once you get outside of that, it's primarily about testing to find the sweet-spot. Typically, I will use ExecuteSQL for retrieving either a JSON object from a user table, or verifying a single field value. Once you get into sorting and aggregate functions, you step outside of the optimizations of the function.
Also note, if you have an open record ( that means you as the current user ), FileMaker Server doesn't know what data you have on the client side, and so it sends ALL of the records. That's why I asked if you were using unstored calcs with ExecuteSQL. It can seem slow when you can't control when the calculations fire. Often I will put the updating of that data into a scheduled script.

What could be causing this 'invalid host' error on kdb query?

I get an odd error when trying to query too many dates from a date-partitioned historical database:
q)eod: h"select from eod where date within 2018.01.01 2018.04.22"
'/tablepath/2018.04.04/eod/somecolumn: invalid host
q)eod: h"select from eod where date within 2018.01.17 2018.04.20"
'/tablepath/2018.04.20/eod/othercolumn: invalid host
q)eod: h"select from eod where date within 2018.01.18 2018.04.20"
q)
Note that both dates mentioned in the error messages are within the date range that we manage to extract in the end, and that it fails on a different column each time. This seems to indicate it's something to do with the size of the table being pulled, but when we check the size of the largest table we managed to get:
q)(-22!eod) % 1024 * 1024
646.9043
q)count eod
2872546
we find that it's not particularly large by either memory size nor by number of rows.
Googling for "invalid host" errors doesn't seem to turn up anything relevant, and I'm not seeing anything in the kdb docs about size limits that would be relevant. Anyone got any ideas?
Edit:
When loading the table in a session and making the queries directly, we get what appears to be the same error, but with a different message. For instance:
q)jj: select from eod where date within 2018.01.01 2018.04.22
Too many compressed files open
k){0!(?).#[x;0;p1[;y;z]]}
'./2018.04.04/eod/settlecab: No such file or directory
.
?
(+`exch`date`class..
q.Q))
Note that the file ./2018.04.04/eod/settlecab does in fact exist, and contains data:
I have no problem loading the data for just the date mentioned in the error, and the column mentioned has meaningful values:
q)jj: select from eod where date=2018.04.04
q)select count i by settlecab from jj
settlecab| x
---------| -----
0 | 41573
1 | 2269
The key point seems to be the Too many compressed files open message, but what can I do about this?
Edit for Summary/Solutions:
The table in question had many columns, all stored in a compressed format. When issuing a query against too many dates at once, kdb would try to mmap all of those columns at once, running into a limit on how many compressed files could be open at once.
Once I understood the problem, several solutions were available:
I could pull only certain columns from the database, reducing the number of files that kdb needed to keep open,
I could force kdb to pull all the data into memory by adding a dummy where clause to the query, such as (null column) | not null column (hacky, but it works),
I could have upgraded the kdb version and lifted OS limits (not practical in my case).
I still have no idea why this resulted in an invalid host error when querying the database remotely.
First off, can we just clarify the database structure you're working with. It seems from the filepaths returned in your errors that you've got a date-partitioned database. Did you mean non-segmented database when you said non-partitioned in your original query?
In terms of a fix for your issue, have you tried loading your database into a session, and making those queries directly? If so do you get the same issues?
If that seems to be working alright, the problem might lie with how you're defining your database handle. How is h defined in your original example?
It might also be worth trying to select individual dates from your database, to try and isolate the problem, and to determine if it lies with your on-disk data. Try specifically querying the dates that are mentioned in your errors.
You could also try performing your original queries with a subset a columns, again to try and pinpoint where your issue is coming from.
Let us know if you get any further with this.
Joseph

Postgres 'if not exists' fails because the sequence exists

I have several counters in an application I am building, as am trying to get them to be dynamically created by the application as required.
For a simplistic example, if someone types a word into a script it should return the number of times that word has been entered previously. Here is an example of sql that may be executed if they typed the word example.
CREATE SEQUENCE IF NOT EXISTS example START WITH 1;
SELECT nextval('example')
This would return 1 the first time it ran, 2 the second time, etc.
The problem is when 2 people click the button at the same time.
First, please note that a lot more is happening in my application than just these statements, so the chances of them overlapping is much more significant than it would be if this was all that was happening.
1> BEGIN;
2> BEGIN;
1> CREATE SEQUENCE IF NOT EXISTS example START WITH 1;
2> CREATE SEQUENCE IF NOT EXISTS example START WITH 1; -- is blocked by previous statement
1> SELECT nextval('example') -- returns 1 to user.
1> COMMIT; -- unblocks second connection
2> ERROR: duplicate key value violates unique constraint
"pg_type_typname_nsp_index"
DETAIL: Key (typname, typnamespace)=(example, 109649) already exists.
I was under the impression that by using "IF NOT EXISTS", the statement should just be a no-op if it does exist, but it seems to have this race condition where that is not the case. I say race condition because if these two are not executed at the same time, it works as one would expect.
I have noticed that IF NOT EXISTS is fairly new to postgres, so maybe they haven't worked out all of the kinks yet?
EDIT:
The main reason we were considering doing things this way was to avoid excess locking. The thought being that if two people were to increment at the same time, using a sequence would mean that neither user should have to wait for the other (except, as in this example, for the initial creation of that sequence)
Sequences are part of the database schema. If you find yourself modifying the schema dynamically based on the data stored in the database, you are probably doing something wrong. This is especially true for sequences, which have special properties e.g. regarding their behavior with respect to transactions. Specifically, if you increment a sequence (with the help of nextval) in the middle of a transaction and then you rollback that transaction, the value of the sequence will not be rolled back. So most likely, this kind of behavior is something that you don't want with your data. In your example, imagine that a user tries to add word. This results in the corresponding sequence being incremented. Now imagine that the transaction does not complete for reason (e.g. maybe the computer crashes) and it gets rolled back. You would end up with the word not being added to the database but with the sequence being incremented.
For the particular example that you mentioned, there is an easy solution; create an ordinary table to store all the "sequences". Something like that would do it:
CREATE TABLE word_frequency (
word text NOT NULL UNIQUE,
frequency integer NOT NULL
);
Now I understand that this is just an example, but if this approach doesn't work for your actual use case, let us know and we can adjust it to your needs.
Edit: Here's how you the above solution works. If a new word is added, run the following query ("UPSERT" syntax in postgres 9.5+ only):
INSERT INTO word_frequency(word,frequency)
VALUES ('foo',1)
ON CONFLICT (word)
DO UPDATE
SET frequency = word_frequency.frequency + excluded.frequency
RETURNING frequency;
This query will insert a new word in word_frequency with frequency 1, or if the word exists already it will increment the existing frequency by 1. Now what happens if two transaction try to do that at the same time? Consider the following scenario:
client 1 client 2
-------- --------
BEGIN
BEGIN
UPSERT ('foo',1)
UPSERT ('foo',1) <====
COMMIT
COMMIT
What will happen is that as soon as client 2 tries increment the frequency for foo (marked with the arrow above), that operation will block because the row was modified by a different transaction. When client 1 commits, client 2 will get unblocked and continue without any errors. This is exactly how we wanted it to work. Also note, that postgresql will use row-level locking to implement this behavior, so other insertions will not be blocked.
EDIT: The main reason we were considering doing things this way was to
avoid excess locking. The thought being that if two people were to
increment at the same time, using a sequence would mean that neither
user should have to wait for the other (except, as in this example,
for the initial creation of that sequence)
It sounds like you're optimizing for a problem that likely does not exist. Sure, if you have 100,000 simultaneous users that are only inserting rows (since a sequence will only be used then normally) there is the possibility of some contention with the sequence but realistically there will be other bottle necks long before the sequence gets in the way.
I'd advise you to first prove that the sequence is an issue. With a proper database design (which dynamic DDL is not) the sequence will not be the bottle neck.
As a reference, DDL is not transaction safe in most databases.

Can I have more than 250 columns in the result of a PostgreSQL query?

Note that PostgreSQL website mentions that it has a limit on number of columns between 250-1600 columns depending on column types.
Scenario:
Say I have data in 17 tables each table having around 100 columns. All are joinable through primary keys. Would it be okay if I select all these columns in a single select statement? The query would be pretty complex but can be programmatically generated. The reason for doing this is to get denormalised data to populate a web page. Please do not ask why though :)
Quite obviously if I do create table table1 as (<the complex select statement>), I will be hitting the limit mentioned in the website. But do simple queries also face the same restriction?
I could probably find this out by doing the exercise myself. In the next few days I probably will. However, if someone has an idea about this and the problems I might face by doing a single query, please share the knowledge.
I can't find definitive documentation to back this up, but I have
received the following error using JDBC on Postgresql 9.1 before.
org.postgresql.util.PSQLException: ERROR: target lists can have at most 1664 entries
As I say though, I can't find the documentation for that so it may
vary by release.
I've found the confirmation. The maximum is 1664.
This is one of the metrics that is available for confirmation in the INFORMATION_SCHEMA.SQL_SIZING table.
SELECT * FROM INFORMATION_SCHEMA.SQL_SIZING
WHERE SIZING_NAME = 'MAXIMUM COLUMNS IN SELECT';

Is it correct to scan a table in MySQL using "SELECT * .. LiMIT start, count" without an ORDER BY clause?

Suppose Table X has a 100 tuples.
Will the following approach to scanning X generate all the tuples in TABLE X, in MySQL?
for start in [0, 10, 20, ..., 90]:
print results of "select * from X LIMIT start, 10;"
I ask, because I've been using PostgreSQL, which clearly says that this approach need not work, but there seems to be no such info for MySQL. If it won't, is there a way to return results in a fixed ordering without knowing any other info about the table (like what the primary key fields are)?
I need to scan each tuple in a table in an application, and I want a way to do it without using too much memory in the application (so simply doing a "select * from X" is out).
No, that isn't a safe assumption. Without an ORDER BY clause, there is no guaranteeing that your query will return unique results each time. If this table is properly indexed, adding an ORDER BY (for the index) shouldn't be too expensive.
Edit: Non-ORDER BYed results will sometimes be in the order of the clustered index, but I wouldn't put any money on that!
If you are using Innodb or MyISAM table types, a better approach is to use the HANDLER interface. Only MySQL supports this, but it does what you want:
http://dev.mysql.com/doc/refman/5.0/en/handler.html
Also, the MySQL API supports two modes of retrieving data from the server:
store result: in this mode, as soon as a query is executed, the API retrieves the entire result set before returning to the user code. This can use up a lot of client memory buffering results, but minimises the use of resources on the server.
use result: in this mode, the API pulls results row-by-row and returns control to the user code more frequently. This minimises the use of memory on the client, but can hold locks on the server for longer.
Most of the MySQL APIs for various languages support this in oneform or another. It is usually an argument that can be supplied as when creating the connection, and / or a separate call that can be used against an existing connection to switch it to that mode.
So, in answer to your question - I would do the following:
set the connection to "use result" mode;
select * from X