sqlite3_enable_shared_cache and sqlite_backup_init slowing execution on iPhone - iphone

I have some relatively complex sqlite queries running in my iPhone app, and some are taking way too much time (upwards of 15 seconds). There are only 430 or so records in the database. One thing I've noticed is that opening a new database connection (which I only do once) and stepping through query results (with sqlite3_step()) causes sqlite3_backup_init() and sqlite3_enable_shared_cache() to run, which take 4450ms and 3720ms respectively of processing time throughout the test period. I tried using sqlite3_enable_shared_cache(0); to disable shared caching before opening the database connection, but this seems to have no effect.
Does anyone have any idea how to disable these so I can get some sort of speed improvement?

Well, I suppose this doesn't directly answer the question, but part of the problem was my use of a cross join instead of a left join. That reduced the query time from about 4000ms to 60ms. Also, the backup_init function is no longer called and enable_shared_cache doesn't spin as much.

I fixed my app by replacing inner join with left join. My data allows it.
If your data does not allow it consider adding a where clause
apples a inner join bananas b on b.id=a.id
vs.
apples a left join bananas b on b.id=a.id where b.id is not null

Related

How can I achieve a paginated join, with the first page returned quickly?

I am looking to join multiple big tables in the OLAP layer to power the UI. Since the tables are really large, response for each join query takes too long. I want to get results in less than 3 seconds. But the catch is I don't want the entire joined data at once because I am only displaying a small subset of the result in the UI at any particular point. Only user interaction would require me to show the next subset of the result.
I am looking for a strategy to create a system where I can perform the same join queries, but initially only a small subset is joined and used for powering the UI. Meanwhile, the rest of the smaller subsets of data is joined in the background and that gets pulled into the UI when required. Is this the right way to approach this problem, where I have to perform really big joins? If so, how can I design such a system?
You can use a WITH HOLD cursor:
START TRANSACTION;
DECLARE c CURSOR WITH HOLD FOR SELECT /* big query */;
FETCH 50 FROM c;
COMMIT;
The COMMIT will take a long time, as it materializes the whole result set, but the FETCH 50 can be reasonably fast (or not, depending on the query).
You can then continue fetching from the cursor. Don't forget to close the cursor when you are done:
CLOSE c;

How do I efficiently execute large queries?

Consider the following demo schema
trades:([]symbol:`$();ccy:`$();arrivalTime:`datetime$();tradeDate:`date$(); price:`float$();nominal:`float$());
marketPrices:([]sym:`$();dateTime:`datetime$();price:`float$());
usdRates:([]currency$();dateTime:`datetime$();fxRate:`float$());
I want to write a query that gets the price, translated into USD, at the soonest possible time after arrivalTime. My beginner way of doing this has been to create intermediate tables that do some filtering and translating column names to be consistent and then using aj and ajo to join them up.
In this case there would only be 2 intermediate tables. In my actual case there are necessarily 7 intermediate tables and records counts, while not large by KDB standards, are not small either.
What is considered best practice for queries like this? It seems to me that creating all these intermediate tables is resource hungry. An alternative to the intermediate tables is 2 have a very complicated looking single query. Would that actually help things? Or is this consumption of resources just the price to pay?
For joining to the next closest time after an event take a look at this question:
KDB reverse asof join (aj) ie on next quote instead of previous one
Assuming that's what your looking for then you should be able to perform your price calculation either before or after the join (depending on the size of your tables it may be faster to do it after). Ultimately I think you will need two (potentially modified as per above) aj's (rates to marketdata, marketdata to trades).
If that's not what you're looking for then I could give some more specifics although some sample data would be useful.
My thoughts:
The more verbose/readible your code, the better for you to debug later and any future readers/users of your code.
Unless absolutely necessary, I would try and avoid creating 7 copies of the same table. If you are dealing with large tables memory could quickly become a concern. Particularly if the processing takes a long time, you could be creating large memory spikes. I try to keep to updating 1-2 variables at different stages e.g.:
res: select from trades;
res:aj[`ccy`arrivalTime;
res;
select ccy:currency, arrivalTime:dateTime, fxRate from usdRates
]
res:update someFunc fxRate from res;
Sean beat me to it, but aj for a time after/ reverse aj is relatively straight forward by switching bin to binr in the k code. See the suggested answer.
I'm not sure why you need 7 intermediary tables unless you are possibly calculating cross rates? In this case I would typically join ccy1 and ccy2 with 2 ajs to the same table and take it from there.
Although it may be unavoidable in your case if you have no control over the source data, similar column names / greater consistency across schemas is generally better. e.g. sym vs symbol

Multiple joins in Pentaho

I am trying to join 5 flows in which the first one is the driver and the others are being left outer joined to driver. I have used Ab Initio in the past where we could use a single join component and specify the kind of join for each input flow. I couldn't find any such Step in the Pentaho and hence I have to rely on the Merge Join which left outer joins only two tables at a time and then take its result to join with the next and so on and so forth. I am planning to do all that in a single transformation.
What I am worried about it since Pentaho runs all the Steps in parallel, it might start to run a join which is much later in the flow without waiting for an earlier join to complete. Is this a valid concern? if so how do you tackle it in a single transformation?
That’s correct you can only join 2 steps at a time.
Answer to your second point, no it will execute parallel but your second join will wait for your first joie to finish. So you will get a proper result only.

Use cases for lateral that do not involve a set returning function

I was reading this post the other day:
http://blog.heapanalytics.com/postgresqls-powerful-new-join-type-lateral/
I suspected some of the claims in the post may not have been accurate. This
one in particular:
"Without lateral joins, we would need to resort to PL/pgSQL to do this
analysis. Or, if our data set were small, we could get away with complex,
inefficient queries."
The sum(1) and order by time limit 1 approach seemed less than ideal to me
and I thought this analysis could be done with normal left joins instead of
lateral left joins. So I came up with a proof of concept:
https://github.com/ajw0100/snippets/tree/master/SQL/lateral
Is my conclusion in the README correct? Does anything beyond
select...from...where in a laterally joined subquery force a nested loop? In
that case, is lateral really only useful with set returning functions as the
docs suggest? Does anyone know of any use cases for lateral that do not
involve a set returning function?

What are the pitfalls of setting enable_nestloop to OFF

I have a query in my application that runs very fast when there are large number of rows in my tables. But when the number of rows is a moderate size (neither large nor small) - the same query runs as much as 15 times slower.
The explain plan shows that the query on a medium sized data set is using nested loops for its join algorithm. The large data set uses hashed joins.
I can discourage the query planner from using nested loops either at the database level (postgresql.conf) or per session (SET enable_nestloop TO off).
What are the potential pitfalls of set enable_nestloop to off?
Other info: PostgreSQL 8.2.6, running on Windows.
What are the potential pitfalls of setting enable_nestloop to off?
This means that you will never be able to use indexes efficiently.
And it seems that you don't use them now.
The query like this:
SELECT u.name, p.name
FROM users u
JOIN profiles p ON p.id = u.profile_id
WHERE u.id = :id
will most probably use NESTED LOOPS with an INDEX SCAN on user.id and an INDEX SCAN on profile.id, provided that you have built indices on these fields.
Queries with low selectivity filters (that is, queries that need more than 10% of data from tables they use) will benefit from MERGE JOINS and HASH JOINS.
But the queries like one given above require NESTED LOOPS to run efficiently.
If you post your queries and table definitions here, probably much may be done about the indexes and queries performance.
A few things to consider before taking such drastic measures:
upgrade your installation to the latest 8.2.x (which right now is 8.2.12). Even better - consider upgrading to the next stable version which is 8.3 (8.3.6).
consider changing your production platform to something other than Windows. The Windows port of PostgreSQL, although very useful for development purpose, is still not on a par with the Un*x ones.
read the first paragraph of "Planner Method Configuration". This wiki page probably will help too.
I have the exact same experience. Some queries on a large database were executed using nested loops and that took 12 hours!!! when it runs in 30 seconds when turning off nested loops or removing the indexes.
Having hints would be really nice here, but I tried
...
SET ENABLE_NESTLOOP TO FALSE;
... critical query
SET ENABLE_NESTLOOP TO TRUE;
...
to deal with this matter. So you can definitely disable and re-enable nested loop use, and you can't argue with a 9000-fold speed increase :)
One problem I have is to do the change of ENABLE_NESTLOOP in a PgSQL/PL procedure. I can run an SQL script in Aqua Data Studio doing everything right, but when I put it in a PgSQL/PL procedure, it then still takes 12 hours. Apparently it was ignoring the change.