in my queries I often use Enumerable.Contains(..), and in complex queries actual query compilation takes forever (like 10-20s), while the query execution takes less than a second.
Does anyone have any advice (workaround) regarding this problem. I need to use Enumerable.Contains, since I have a table with hierarchical structure (n-level deep). If we take a well known example:
CPU
-- AMD
-- FX
-- A8
-- Intel
-- I3
-- I5
etc. If I need to retrieve data for all CPU's I first get the IDs of all products that belong to this group (and subgroups), and then use
ProductIds.Contains(p.Id)
in a query. This also prevents caching, so query is always being recompiled. If anyone knows a better way, my ears are open.
Regards,
Goran
Related
Consider the following demo schema
trades:([]symbol:`$();ccy:`$();arrivalTime:`datetime$();tradeDate:`date$(); price:`float$();nominal:`float$());
marketPrices:([]sym:`$();dateTime:`datetime$();price:`float$());
usdRates:([]currency$();dateTime:`datetime$();fxRate:`float$());
I want to write a query that gets the price, translated into USD, at the soonest possible time after arrivalTime. My beginner way of doing this has been to create intermediate tables that do some filtering and translating column names to be consistent and then using aj and ajo to join them up.
In this case there would only be 2 intermediate tables. In my actual case there are necessarily 7 intermediate tables and records counts, while not large by KDB standards, are not small either.
What is considered best practice for queries like this? It seems to me that creating all these intermediate tables is resource hungry. An alternative to the intermediate tables is 2 have a very complicated looking single query. Would that actually help things? Or is this consumption of resources just the price to pay?
For joining to the next closest time after an event take a look at this question:
KDB reverse asof join (aj) ie on next quote instead of previous one
Assuming that's what your looking for then you should be able to perform your price calculation either before or after the join (depending on the size of your tables it may be faster to do it after). Ultimately I think you will need two (potentially modified as per above) aj's (rates to marketdata, marketdata to trades).
If that's not what you're looking for then I could give some more specifics although some sample data would be useful.
My thoughts:
The more verbose/readible your code, the better for you to debug later and any future readers/users of your code.
Unless absolutely necessary, I would try and avoid creating 7 copies of the same table. If you are dealing with large tables memory could quickly become a concern. Particularly if the processing takes a long time, you could be creating large memory spikes. I try to keep to updating 1-2 variables at different stages e.g.:
res: select from trades;
res:aj[`ccy`arrivalTime;
res;
select ccy:currency, arrivalTime:dateTime, fxRate from usdRates
]
res:update someFunc fxRate from res;
Sean beat me to it, but aj for a time after/ reverse aj is relatively straight forward by switching bin to binr in the k code. See the suggested answer.
I'm not sure why you need 7 intermediary tables unless you are possibly calculating cross rates? In this case I would typically join ccy1 and ccy2 with 2 ajs to the same table and take it from there.
Although it may be unavoidable in your case if you have no control over the source data, similar column names / greater consistency across schemas is generally better. e.g. sym vs symbol
I am trying to compare the performance of a view before and after adding an index. So I am trying to measure the performance of it using below query:
create table qtemp.ffs as (select * from psavlldsvw) with data
Statement ran successfully (1,932 ms = 1.932 sec)
Above statement is what I have used where psavlldsvw is the view name.
As you might guess, the idea is to measure how much time the above query takes to complete in both cases.
Can I please get some feedback on how good this method is for comparison?
The test is indeed meaningless...
First of all the question is poorly worded, you can not and are not testing a view. Views are performance neutral on Db2 for i.
Running a statement, adding an index and rerunning the statement is a meaningless test. Db2 for i has all kinds tricks built in to improve the speed of a repeated statement. Among them
Input data cached in memory
Data access paths are left open
Starting from a fresh connection, you can ensure that no data is in memory by using SETOBJACC OBJ(YOURLIB/YOURFILE) OBJTYPE(*FILE) POOL(*PURGE) for each table referenced by your statement.
Now run the statement multiple times; at least 3 if the system defaults have not been changed. You should see that the first (few) iterations is slower than the last few. This is a result of the data access path being left open for a repeated statement.
Now add your index, disconnect/reconnect, clear the object(s) from memory and run your tests again.
Depending on the use case for the statement, you may want to focus on the first iteration performance or the later iterations.
Mao is correct in that using Visual Explain (VE) is the best way to see if an index is being used or otherwise understanding how the query is performing.
Lastly realize that load on the server effects how the query engine operates. The query engine optimizer will calculate your jobs "fair share" of memory and that value will affect rather or not some more efficient yet memory intensive plans would be used. So if you're testing in a non-prod environment that doesn't exactly match prod in terms of resources, data size and load, the results are likely to differ when the query is moved to prod.
Performance tuning is part art, part science. Generally, use VE to ensure that you've got a decent query to start with. Then monitor actual production use to ensure that it's preforming as expected.
Trying to understand EXPLAIN function - I have two queries - first query is optimised, that is running 600 ms(I have 100k rows) and second query is running 900 ms
But when I run EXPLAIN ANALYZE - first query, that is running quickly shows me cost - 64296 and second query, that is running slowly shows me cost - 20873
can't understand why faster query has bigger cost, and why longer running query has smaller cost.
Could someone give me some hint ?
PostgreSQL EXPLAIN is an animal that really has a lot of arms & legs, each of which can cause it to work in a way that isn't easy to understand at first.
To answer your question, I understand that although running the first query Q1 (not its EXPLAIN), it runs faster than the second (Q2), but when you do an EXPLAIN ANALYSE, Q1 actually has a higher cost.
I could think of two reasons that come to mind at this moment:
If the Queries are LIMIT queries, its possible for Q1 to execute faster and still have higher 'cost', since the PostgreSQL Planner (intentionally) does not plan for a smaller total cost, but a smaller cost of the required result (in this case, a smaller number of rows).
Another reason could be that caching could be playing havoc with your times. Could you confirm if the observation is persistent with multiple (3+) runs?
Besides these hunches, if you really want to get deep into understanding EXPLAIN, recommend you to refer the following articles here, here and here.
Cost is what planner thinks about how many recourses (I/O and CPU time) it will take to perform the query. It's just an estimation, calculated by a mathematical model.
In your case planner was wrong, it chose suboptimal plan. It happens sometimes.
Why? There could be many reasons. Maybe statistics are inadequate (try to run analyze for your tables first of all). Maybe statistics are ok, but planner uses the wrong model (for example, you may have correlated predicates in your query which are known to be problematic). Maybe your query is over several dozens of tables and planner just can't go through all possible plans. And so on.
I have many read-only tables in a Postgres database. All of these tables can be queried using any combination of columns.
What can I do to optimize queries? Is it a good idea to add indexes to all columns to all tables?
Columns that are used for filtering or joining (or, to a lesser degree, sorting) are of interest for indexing. Columns that are just selected are barely relevant!
For the following query only indexes on a and e may be useful:
SELECT a,b,c,d
FROM tbl_a
WHERE a = $some_value
AND e < $other_value;
Here, f and possibly c are candidates, too:
SELECT a,b,c,d
FROM tbl_a
JOIN tbl_b USING (f)
WHERE a = $some_value
AND e < $other_value
ORDER BY c;
After creating indexes, test to see if they are actually useful with EXPLAIN ANALYZE. Also compare execution times with and without the indexes. Deleting and recreating indexes is fast and easy. There are also parameters to experiment with EXPLAIN ANALYZE. The difference may be staggering or nonexistent.
As your tables are read only, index maintenance is cheap. It's merely a question of disk space.
If you really want to know what you are doing, start by reading the docs.
If you don't know what queries to expect ...
Try logging enough queries to find typical use cases. Log queries with the parameter log_statement = all for that. Or just log slow queries using log_min_duration_statement.
Create indexes that might be useful and check the statistics after some time to see what actually gets used. PostgreSQL has a whole infrastructure in place for monitoring statistics. One convenient way to study statistics (and many other tasks) is pgAdmin where you can chose your table / function / index and get all the data on the "statistics" tab in the object browser (main window).
Proceed as described above to see if indexes in use actually speed things up.
If the query planner should chose to use one or more of your indexes but to no or adverse effect then something is probably wrong with your setup and you need to study the basics of performance optimization: vacuum, analyze, cost parameters, memory usage, ...
If you have filtering by more columns indexes may help but not too much. Also indexes may not help for small tables.
First search for "postgresql tuning" - you will find usefull information.
If database can fit in memory - buy enough RAM.
If database can not fit in memory - SSD will help.
If this is not enough and database is read only - run 2, 3 or more servers. Or partition database (in the best case to fit in memory of each server).
Even if queries are generated I think they will not be random. Monitor database for slow queries and improve only them.
I’m having a question about the fine line between the gain of an index to a table there is growing steadily in size every month and the gain of queries with an index.
The situation is, that I’ve two tables, Table1 and Table2. Each table grows slowly but regularly each month (with about 100 new rows for Table1 and a couple of rows for Table2).
My concrete question is whether to have an index or to drop it. I’ve made some measurement that an covering index on Table2 improve my SELECT queries and some rather much but again, I’ve to consider the pros and cons but having a really hard time to decide.
For Table1 it might not be necessary to have an index because the SELECT queries there is not that common.
I would appreciate any suggestion, tips or just good advice to what is a good solution.
By the way, I’m using IBM DB2 version 9.7 as my Database system
Sincerely
Mestika
Any additional index will make your inserts slower and your queries faster.
To take a smart decision, you will have to measure exactly by how much, with the amount of data that you expect to see. If you have multiple clients accessing the database at the same time, it may make sense to write a small multithreaded application that simulates the maximum load, both for inserts and for queries.
Your results will depend on the nature of your data and on the hardware that you are running. If you want to know the best answer for your usecase, there is no way around testin accurately yourself with your data and your hardware.
Then you will have to ask yourself:
Which query performance do I need?
If the query performance is good enough without the index anyway, easy: Don't add the index!
Which insert performance do I need?
Can it drop below the needed limit with the additional index? If not, easy: Add the index!
If you discover that you absolutely need the index for query performance and you can't get the required insert performance with the index, you may need to buy better hardware. Solid state discs can do wonders for database servers and they are getting affordable.
If your system is running fine for everyone anyway, worry less, let it run as is.