Is it correct to scan a table in MySQL using "SELECT * .. LiMIT start, count" without an ORDER BY clause? - postgresql

Suppose Table X has a 100 tuples.
Will the following approach to scanning X generate all the tuples in TABLE X, in MySQL?
for start in [0, 10, 20, ..., 90]:
print results of "select * from X LIMIT start, 10;"
I ask, because I've been using PostgreSQL, which clearly says that this approach need not work, but there seems to be no such info for MySQL. If it won't, is there a way to return results in a fixed ordering without knowing any other info about the table (like what the primary key fields are)?
I need to scan each tuple in a table in an application, and I want a way to do it without using too much memory in the application (so simply doing a "select * from X" is out).

No, that isn't a safe assumption. Without an ORDER BY clause, there is no guaranteeing that your query will return unique results each time. If this table is properly indexed, adding an ORDER BY (for the index) shouldn't be too expensive.
Edit: Non-ORDER BYed results will sometimes be in the order of the clustered index, but I wouldn't put any money on that!

If you are using Innodb or MyISAM table types, a better approach is to use the HANDLER interface. Only MySQL supports this, but it does what you want:
http://dev.mysql.com/doc/refman/5.0/en/handler.html
Also, the MySQL API supports two modes of retrieving data from the server:
store result: in this mode, as soon as a query is executed, the API retrieves the entire result set before returning to the user code. This can use up a lot of client memory buffering results, but minimises the use of resources on the server.
use result: in this mode, the API pulls results row-by-row and returns control to the user code more frequently. This minimises the use of memory on the client, but can hold locks on the server for longer.
Most of the MySQL APIs for various languages support this in oneform or another. It is usually an argument that can be supplied as when creating the connection, and / or a separate call that can be used against an existing connection to switch it to that mode.
So, in answer to your question - I would do the following:
set the connection to "use result" mode;
select * from X

Related

Statistics of all/many tables in FileMaker

I'm writing a kind of summary page for my FileMaker solution.
For this, I have define a "statistics" table, which uses formula fields with ExecuteSQL to gather info from most tables, such as number of records, recently changed records, etc.
This strangely takes a long time - around 10 seconds when I have a total of about 20k records in about 10 tables. The same SQL on any database system shouldn't take more than some fractions of a second.
What could the reason be, what can I do about it and where can I start debugging to figure out what's causing all this time?
The actual code is, like this:
SQLAusführen ( "SELECT COUNT(*) FROM " & _Stats::Table ; "" ; "" )
SQLAusführen ( "SELECT SUM(\"some_field_name\") FROM " & _Stats::Table ; "" ; "" )
Where "_Stats" is my statistics table, and it has a string field "Table" where I store the name of the other tables.
So each row in this _Stats table should have the stats for the table named in the "Table" field.
Update: I'm not using FileMaker server, this is a standalone client application.
We can definitely talk about why it may be slow. Usually this has mostly to do with the size and complexity of your schema. That is "usually", as you have found.
Can you instead use the DDR ( database design report ) instead? Much will depend on what you are actually doing with this data. Tools like FMPerception also will give you many of the stats you are looking for. Again, depends on what you are doing with it.
Also, can you post your actual calculation? Is the statistic table using unstored calculations? Is the statistics table related to any of the other tables? These are a couple things that will affect how ExecuteSQL performs.
One thing to keep in mind, whether ExecuteSQL, a Perform Find, or relationship, it's all the same basic query under-the-hood. So if it would be slow doing it one way, it's going to likely be slow with any other directly related approach.
Taking these one at a time:
All records count.
Placing an unstored calc in the target table allows you to get the count of the records through the relationship, without triggering a transfer of all records to the client. You can get the value from the first record in the relationship. Super light way to get that info vs using Count which requires FileMaker to touch every record on the other side.
Sum of Records Matching a Value.
using a field on the _Stats table with a relationship to the target table will reduce how much work FileMaker has to do to give you an answer.
Then having a Summary field in the target table so sum the records may prove to be more efficient than using an aggregate function. The summary field will also only sum the records that match the relationship. ( just don't show that field on any of your layouts if you don't need it )
ExecuteSQL is fastest when it can just rely on a simple index lookup. Once you get outside of that, it's primarily about testing to find the sweet-spot. Typically, I will use ExecuteSQL for retrieving either a JSON object from a user table, or verifying a single field value. Once you get into sorting and aggregate functions, you step outside of the optimizations of the function.
Also note, if you have an open record ( that means you as the current user ), FileMaker Server doesn't know what data you have on the client side, and so it sends ALL of the records. That's why I asked if you were using unstored calcs with ExecuteSQL. It can seem slow when you can't control when the calculations fire. Often I will put the updating of that data into a scheduled script.

Sphinx / Manticore - base one plain index off another?

I have a plain text index that sucks data from MySQL and inserts it into Manticore in a format I need (e.g. converting datetime strings to timestamp, CONCATing some fields etc.
I then want to create a second plain text index based off this data to group it further. This will save me having to either re-run the normalisation that's done to the first index on INSERT or make it easier for me to query in the future.
For example, my first index is a list of all phone calls that have been made / received (telephone number, duration, agent). The second index should group by Year-Month-Date in such a way that I can see how many calls each agent made on that day. This means I end up with idx_phone_calls and idx_phone_calls_by_date.
Currently, I generate the first index from MySQL, then get Manticore to query itself (by setting the MySQL host to localhost. It works, but it feels as though I should be able to query Manticore directly from within the index. However, I'm struggling to find if that's possible.
Is there a better way to do it?
Well Sphinx/Manticore, has its own GROUP BY function. So maybe can just run the final query against the original index anyway, avoid the need for the second index.
Sphinx's Aggregation (in some way) is more powerful than MySQL, and can do some 'super aggregation' functions (like with WITHIN GROUP ORDER BY)
But otherwise there is no direct way to create an off another (eg there is no CREATE TABLE idx_phone_calls_by_date SELECT ... FROM idx_phone_calls ... )
Your 'solution' of directing indexer to query the data from searchd is good. In general this should be pretty efficent, particully on localhost, there is little overhead. Maintains the logical seperation of searchd being for queries, indexer being for well building indexes.

where column in (single value) performance

I am writing dynamic sql code and it would be easier to use a generic where column in (<comma-seperated values>) clause, even when the clause might have 1 term (it will never have 0).
So, does this query:
select * from table where column in (value1)
have any different performance than
select * from table where column=value1
?
All my test result in the same execution plans, but if there is some knowledge/documentation that sets it to stone, it would be helpful.
This might not hold true for each and any RDBMS as well as for each an any query with its specific circumstances.
The engine will translate WHERE id IN(1,2,3) to WHERE id=1 OR id=2 OR id=3.
So your two ways to articulate the predicate will (probably) lead to exactly the same interpretation.
As always: We should not really bother about the way the engine "thinks". This was done pretty well by the developers :-) We tell - through a statement - what we want to get and not how we want to get this.
Some more details here, especially the first part.
I Think this will depend on platform you are using (optimizer of the given SQL engine).
I did a little test using MySQL Server and:
When I query select * from table where id = 1; i get 1 total, Query took 0.0043 seconds
When I query select * from table where id IN (1); i get 1 total, Query took 0.0039 seconds
I know this depends on Server and PC and what.. But The results are very close.
But you have to remember that IN is non-sargable (non search argument able), it will not use the index to resolve the query, = is sargable and support the index..
If you want the best one to use, You should test them in your environment because they both work so good!!

How to optimize generic SQL to retrieve DDL information

I have a generic code that is used to retrieve DDL information from a Firebird database (FB2.1). It generates SQL code like
SELECT * FROM MyTable where 'c' <> 'c'
I cannot change this code. Actually, if that matters, it is inside Report Builder 10.
The fact is that some tables from my database are becoming a litle too populated (>1M records) and that query is starting to take too long to execute.
If I try to execute
SELECT * FROM MyTable where SomeIndexedField = SomeImpossibleValue
it will obviously use that index and run very quickly.
Well, it wouldn´t be that hard to the database find out that that is an impossible matcher and make some sort of optimization and avoid testing it against each row.
Is there any way to make my firebird database to optimize that search?
As the filter condition is a negative proposition (and also doesn't refer a column to search, but only a value to compare to another value), Firebird need to do a full table scan (without use any index) to confirm that aren't any record that meet your criteria.
If you can't change you need to wait for the upcoming 3.0 version, that will implement the Boolean data type, and therefore should start to evaluate "constant" fake comparisons in advance (maybe the client library will do this evaluation before send the statement to the server?).

T-SQL speed comparison between LEFT() vs. LIKE operator

I'm creating result paging based on first letter of certain nvarchar column and not the usual one, that usually pages on number of results.
And I'm not faced with a challenge whether to filter results using LIKE operator or equality (=) operator.
select *
from table
where name like #firstletter + '%'
vs.
select *
from table
where left(name, 1) = #firstletter
I've tried searching the net for speed comparison between the two, but it's hard to find any results, since most search results are related to LEFT JOINs and not LEFT function.
"Left" vs "Like" -- one should always use "Like" when possible where indexes are implemented because "Like" is not a function and therefore can utilize any indexes you may have on the data.
"Left", on the other hand, is function, and therefore cannot make use of indexes. This web page describes the usage differences with some examples. What this means is SQL server has to evaluate the function for every record that's returned.
"Substring" and other similar functions are also culprits.
Your best bet would be to measure the performance on real production data rather than trying to guess (or ask us). That's because performance can sometimes depend on the data you're processing, although in this case it seems unlikely (but I don't know that, hence why you should check).
If this is a query you will be doing a lot, you should consider another (indexed) column which contains the lowercased first letter of name and have it set by an insert/update trigger.
This will, at the cost of a minimal storage increase, make this query blindingly fast:
select * from table where name_first_char_lower = #firstletter
That's because most database are read far more often than written, and this will amortise the cost of the calculation (done only for writes) across all reads.
It introduces redundant data but it's okay to do that for performance as long as you understand (and mitigate, as in this suggestion) the consequences and need the extra performance.
I had a similar question, and ran tests on both. Here is my code.
where (VOUCHER like 'PCNSF%'
or voucher like 'PCLTF%'
or VOUCHER like 'PCACH%'
or VOUCHER like 'PCWP%'
or voucher like 'PCINT%')
Returned 1434 rows in 1 min 51 seconds.
vs
where (LEFT(VOUCHER,5) = 'PCNSF'
or LEFT(VOUCHER,5)='PCLTF'
or LEFT(VOUCHER,5) = 'PCACH'
or LEFT(VOUCHER,4)='PCWP'
or LEFT (VOUCHER,5) ='PCINT')
Returned 1434 rows in 1 min 27 seconds
My data is faster with the left 5. As an aside my overall query does hit some indexes.
I would always suggest to use like operator when the search column contains index. I tested the above query in my production environment with select count(column_name) from table_name where left(column_name,3)='AAA' OR left(column_name,3)= 'ABA' OR ... up to 9 OR clauses. My count displays 7301477 records with 4 secs in left and 1 second in like i.e where column_name like 'AAA%' OR Column_Name like 'ABA%' or ... up to 9 like clauses.
Calling a function in where clause is not a best practice. Refer http://blog.sqlauthority.com/2013/03/12/sql-server-avoid-using-function-in-where-clause-scan-to-seek/
Entity Framework Core users
You can use EF.Functions.Like(columnName, searchString + "%") instead of columnName.startsWith(...) and you'll get just a LIKE function in the generated SQL instead of all this 'LEFT' craziness!
Depending upon your needs you will probably need to preprocess searchString.
See also https://github.com/aspnet/EntityFrameworkCore/issues/7429
This function isn't present in Entity Framework (non core) EntityFunctions so I'm not sure how to do it for EF6.