What could be causing this 'invalid host' error on kdb query? - kdb

I get an odd error when trying to query too many dates from a date-partitioned historical database:
q)eod: h"select from eod where date within 2018.01.01 2018.04.22"
'/tablepath/2018.04.04/eod/somecolumn: invalid host
q)eod: h"select from eod where date within 2018.01.17 2018.04.20"
'/tablepath/2018.04.20/eod/othercolumn: invalid host
q)eod: h"select from eod where date within 2018.01.18 2018.04.20"
q)
Note that both dates mentioned in the error messages are within the date range that we manage to extract in the end, and that it fails on a different column each time. This seems to indicate it's something to do with the size of the table being pulled, but when we check the size of the largest table we managed to get:
q)(-22!eod) % 1024 * 1024
646.9043
q)count eod
2872546
we find that it's not particularly large by either memory size nor by number of rows.
Googling for "invalid host" errors doesn't seem to turn up anything relevant, and I'm not seeing anything in the kdb docs about size limits that would be relevant. Anyone got any ideas?
Edit:
When loading the table in a session and making the queries directly, we get what appears to be the same error, but with a different message. For instance:
q)jj: select from eod where date within 2018.01.01 2018.04.22
Too many compressed files open
k){0!(?).#[x;0;p1[;y;z]]}
'./2018.04.04/eod/settlecab: No such file or directory
.
?
(+`exch`date`class..
q.Q))
Note that the file ./2018.04.04/eod/settlecab does in fact exist, and contains data:
I have no problem loading the data for just the date mentioned in the error, and the column mentioned has meaningful values:
q)jj: select from eod where date=2018.04.04
q)select count i by settlecab from jj
settlecab| x
---------| -----
0 | 41573
1 | 2269
The key point seems to be the Too many compressed files open message, but what can I do about this?
Edit for Summary/Solutions:
The table in question had many columns, all stored in a compressed format. When issuing a query against too many dates at once, kdb would try to mmap all of those columns at once, running into a limit on how many compressed files could be open at once.
Once I understood the problem, several solutions were available:
I could pull only certain columns from the database, reducing the number of files that kdb needed to keep open,
I could force kdb to pull all the data into memory by adding a dummy where clause to the query, such as (null column) | not null column (hacky, but it works),
I could have upgraded the kdb version and lifted OS limits (not practical in my case).
I still have no idea why this resulted in an invalid host error when querying the database remotely.

First off, can we just clarify the database structure you're working with. It seems from the filepaths returned in your errors that you've got a date-partitioned database. Did you mean non-segmented database when you said non-partitioned in your original query?
In terms of a fix for your issue, have you tried loading your database into a session, and making those queries directly? If so do you get the same issues?
If that seems to be working alright, the problem might lie with how you're defining your database handle. How is h defined in your original example?
It might also be worth trying to select individual dates from your database, to try and isolate the problem, and to determine if it lies with your on-disk data. Try specifically querying the dates that are mentioned in your errors.
You could also try performing your original queries with a subset a columns, again to try and pinpoint where your issue is coming from.
Let us know if you get any further with this.
Joseph

Related

pq: [parent] Data too large

I'm using Grafana to visualize some data stored in CrateDB in different panes.
Some of my boards work correctly, but there are 3 specific boards (created by someone from my work team), in which at certain times of the day they stop showing data (No Data) and as a warning it shows the following error:
db query error: pq: [parent] Data too large, data for [fetch-1] would be [512323840/488.5mb], which is larger than the limit of [510027366/486.3mb], usages [request=0/0b, in_flight_requests=0/0b, query=150023700/143mb, jobs_log=19146608/18.2mb, operations_log=10503056/10mb]
Honestly, I would like to understand what it means, and how I can fix it.
I remain attentive to any help you can give me, and I deeply appreciate the help.
what I tried
17 SQL Statements of the form:
SELECT
time_index AS "time",
entity_id AS metric,
v1_ps
FROM etsm
WHERE
entity_id = 'SM_B3_RECT'
ORDER BY 1,2
for 17 different entities.
what I hope
I hope to receive the data corresponding to each of the SQL statements for their respective graphing.
The result
As a result, there is no data received on some of the statements made and the warning message I shared:
db query error: pq: [parent] Data too large, data for [fetch-1] would be [512323840/488.5mb], which is larger than the limit of [510027366/486.3mb], usages [request=0/0b, in_flight_requests=0/0b, query=150023700/143mb, jobs_log=19146608/18.2mb, operations_log=10503056/10mb]
As an additional fact, the graph is configured to update every 15 min, but no matter how many times you manually update the graph, the statements that receive data are different.
Example: I refresh the panel and the SQL statements A, B and C get data, while the others don't. I refresh the panel and the SQL statements D, H and J receive data, and the others don't (with a random pattern).
Other additional information:
I have access to the database being consulted with Grafana, and the data is there
You don't have time condition, so query select/process all records all the time and you are hitting limits (e. g. size of processed data) of your DB. Add time condition, so only fraction of all records will be returned.

Statistics of all/many tables in FileMaker

I'm writing a kind of summary page for my FileMaker solution.
For this, I have define a "statistics" table, which uses formula fields with ExecuteSQL to gather info from most tables, such as number of records, recently changed records, etc.
This strangely takes a long time - around 10 seconds when I have a total of about 20k records in about 10 tables. The same SQL on any database system shouldn't take more than some fractions of a second.
What could the reason be, what can I do about it and where can I start debugging to figure out what's causing all this time?
The actual code is, like this:
SQLAusführen ( "SELECT COUNT(*) FROM " & _Stats::Table ; "" ; "" )
SQLAusführen ( "SELECT SUM(\"some_field_name\") FROM " & _Stats::Table ; "" ; "" )
Where "_Stats" is my statistics table, and it has a string field "Table" where I store the name of the other tables.
So each row in this _Stats table should have the stats for the table named in the "Table" field.
Update: I'm not using FileMaker server, this is a standalone client application.
We can definitely talk about why it may be slow. Usually this has mostly to do with the size and complexity of your schema. That is "usually", as you have found.
Can you instead use the DDR ( database design report ) instead? Much will depend on what you are actually doing with this data. Tools like FMPerception also will give you many of the stats you are looking for. Again, depends on what you are doing with it.
Also, can you post your actual calculation? Is the statistic table using unstored calculations? Is the statistics table related to any of the other tables? These are a couple things that will affect how ExecuteSQL performs.
One thing to keep in mind, whether ExecuteSQL, a Perform Find, or relationship, it's all the same basic query under-the-hood. So if it would be slow doing it one way, it's going to likely be slow with any other directly related approach.
Taking these one at a time:
All records count.
Placing an unstored calc in the target table allows you to get the count of the records through the relationship, without triggering a transfer of all records to the client. You can get the value from the first record in the relationship. Super light way to get that info vs using Count which requires FileMaker to touch every record on the other side.
Sum of Records Matching a Value.
using a field on the _Stats table with a relationship to the target table will reduce how much work FileMaker has to do to give you an answer.
Then having a Summary field in the target table so sum the records may prove to be more efficient than using an aggregate function. The summary field will also only sum the records that match the relationship. ( just don't show that field on any of your layouts if you don't need it )
ExecuteSQL is fastest when it can just rely on a simple index lookup. Once you get outside of that, it's primarily about testing to find the sweet-spot. Typically, I will use ExecuteSQL for retrieving either a JSON object from a user table, or verifying a single field value. Once you get into sorting and aggregate functions, you step outside of the optimizations of the function.
Also note, if you have an open record ( that means you as the current user ), FileMaker Server doesn't know what data you have on the client side, and so it sends ALL of the records. That's why I asked if you were using unstored calcs with ExecuteSQL. It can seem slow when you can't control when the calculations fire. Often I will put the updating of that data into a scheduled script.

Data Lake Analytics - Large vertex query

I have a simple query which make a GROUP BY using two fields:
#facturas =
SELECT a.CodFactura,
Convert.ToInt32(a.Fecha.ToString("yyyyMMdd")) AS DateKey,
SUM(a.Consumo) AS Consumo
FROM #table_facturas AS a
GROUP BY a.CodFactura, a.DateKey;
#table_facturas has 4100 rows but query takes several minutes to finish. Seeing the graph explorer I see it uses 2500 vertices because I'm having 2500 CodFactura+DateKey unique rows. I don't know if it normal ADAL behaviour. Is there any way to reduce the vertices number and execute this query faster?
First: I am not sure your query actually will compile. You would need the Convert expression in your GROUP BY or do it in a previous SELECT statement.
Secondly: In order to answer your question, we would need to know how the full query is defined. Where does #table_facturas come from? How was it produced?
Without this information, I can only give some wild speculative guesses:
If #table_facturas is coming from an actual U-SQL Table, your table is over partitioned/fragmented. This could be because:
you inserted a lot of data originally with a distribution on the grouping columns and you either have a predicate that reduces the number of rows per partition and/or you do not have uptodate statistics (run CREATE STATISTICS on the columns).
you did a lot of INSERT statements, each inserting a small number of rows into the table, thus creating a big number of individual files. This will "scale-out" the processing as well. Use ALTER TABLE REBUILD to recompact.
If it is coming from a fileset, you may have too many small files in the input. See if you can merge them into less, larger files.
You can also try to hint a small number of rows in your query that creates #table_facturas if the above does not help by adding OPTION(ROWCOUNT=4000).

Can I have more than 250 columns in the result of a PostgreSQL query?

Note that PostgreSQL website mentions that it has a limit on number of columns between 250-1600 columns depending on column types.
Scenario:
Say I have data in 17 tables each table having around 100 columns. All are joinable through primary keys. Would it be okay if I select all these columns in a single select statement? The query would be pretty complex but can be programmatically generated. The reason for doing this is to get denormalised data to populate a web page. Please do not ask why though :)
Quite obviously if I do create table table1 as (<the complex select statement>), I will be hitting the limit mentioned in the website. But do simple queries also face the same restriction?
I could probably find this out by doing the exercise myself. In the next few days I probably will. However, if someone has an idea about this and the problems I might face by doing a single query, please share the knowledge.
I can't find definitive documentation to back this up, but I have
received the following error using JDBC on Postgresql 9.1 before.
org.postgresql.util.PSQLException: ERROR: target lists can have at most 1664 entries
As I say though, I can't find the documentation for that so it may
vary by release.
I've found the confirmation. The maximum is 1664.
This is one of the metrics that is available for confirmation in the INFORMATION_SCHEMA.SQL_SIZING table.
SELECT * FROM INFORMATION_SCHEMA.SQL_SIZING
WHERE SIZING_NAME = 'MAXIMUM COLUMNS IN SELECT';

Understanding MON$STAT_ID in the Firebird monitoring tables

I posted a few weeks back inquiring about the firebird DB and how to monitor it. Since then I have come up with a nifty script that monitors all of the page reads/writes/fetches/marks. One of the columns I am monitoring is the MON$STAT_ID and the MON$STAT_GROUP fields. This prints out a nice number for me; however, I have no way to correlate and understand what exactly it is. I thought printing out the MON$STAT_GROUP would help but it has yet to assist me in any way...
I have also looked into the RDB$ commands but have found very limited documentation to see if they might assist me in monitoring my database.
So I decided to come here and inquire first off whether I am monitoring my database in a way that others can view the data from page reads/writes/fetches/marks and make an intelligent decision on whether or not the database is performing as expected.
Secondly, would adding RDB$ commands to my script add anything to the value of the data that I will be giving our database folks?
Lastly, and maybe most importantly, is there anyway to correlate the MON$STAT_ID fields to an actual table in the database to understand when something is going on that should not be? I currently am monitoring the database every minute which may be to frequent, but I am getting valid data out. The only question now is how to interpret this data. Can someone give me advice on methods they use/have used in the past that have worked for them?
(NOTE: Running firebird 2.1)
The column MON$STAT_ID in MON$IO_STATS (and MON$RECORD_STATS and MON$MEMORY_USAGE) is the primary key of the record in the monitoring table. Almost all other monitoring tables include a MON$STAT_ID to point to these statistics: MON$ATTACHMENTS, MON$CALL_STACK, MON$DATABASE, MON$STATEMENTS, MON$TRANSACTIONS.
In other words: the statistics apply on the database, attachment, transaction, statement or call level (PSQL executes). The statistics tables contain a column called MON$STAT_GROUP to discern these types. The values of MON$STAT_GROUP are described in RDB$TYPES:
0 : DATABASE
1 : ATTACHMENT
2 : TRANSACTION
3 : STATEMENT
4 : CALL
Typically the statistics of level 0 contain all from level 1, level 1 contains all from level 2 for that attachment, level 2 contains all from level 3 for that transaction, level 3 contains all from level 4 for that statement.
As there might be data processed unrelated to the lower level, or a specific attachment, transaction or statement handle has already been dropped, the numbers of the lower level do not necessarily aggregate to the entire number of the higher level.
There is no way to correlate the statistics to a specific table (as this information isn't table related, but - simplified - from executing statements which might cover multiple tables).
As I also commented, I am unsure what you mean with "RDB$ commands". But I am assuming you are talking about RDB$GET_CONTEXT() and RDB$SET_CONTEXT(). You could use RDB$GET_CONTEXT() to obtain the current connection (SESSION_ID) and transaction id (TRANSACTION_ID). These values values can be used for MON$ATTACHMENT_ID and MON$TRANSACTION_ID in the monitoring tables. I don't think the other variables in the SYSTEM namespace are interesting, and those in USER_SESSION and USER_TRANSACTION are all user-defined (and initially those namespaces are empty).
It is far easier to use the CURRENT_CONNECTION and CURRENT_TRANSACTION context variables within a statement. As documented in doc\README.monitoring_tables.txt in the Firebird installation:
System variables CURRENT_CONNECTION and CURRENT_TRANSACTION could be used to select data about the current (for the caller) connection and transaction respectively. These variables correspond to the ID columns of the appropriate monitoring tables.
Note: my answer is based on Firebird 2.5.
To present statistics by specific tables I use this SQL (FB 3)
select t.mon$table_name,trim(
case when r.mon$record_seq_reads>0 then 'Non index Reads: '||r.mon$record_seq_reads else '' end||
case when r.mon$record_idx_reads>0 then ' Index Reads: '||r.mon$record_idx_reads else '' end||
case when r.mon$record_inserts>0 then ' Inserts: '||r.mon$record_inserts else '' end||
case when r.mon$record_updates>0 then ' Updates: '||r.mon$record_updates else '' end||
case when r.mon$record_deletes>0 then ' Deletes: '||r.mon$record_deletes else '' end)
from MON$TABLE_STATS t
join mon$record_stats r on r.mon$stat_id=t.mon$record_stat_id
where t.mon$table_name not starting 'RDB$' and r.mon$stat_group=2
order by 1