Does ActiveRecord#first method always return record with minimal ID? - postgresql

Env: Rails 4.2.4, Postgres 9.4.1.0
Is there a guarantee that ActiveRecord#first method will always return a record with minimal ID and ActiveRecord#last - with maximum ID?
I can see from Rails console that for these 2 methods appropriate ORDER ASC/DESC is added to generated SQL. But an author of another SO thread Rails with Postgres data is returned out of order tells that first method returned NOT first record...
ActiveRecord first:
2.2.3 :001 > Account.first
Account Load (1.3ms) SELECT "accounts".* FROM "accounts" ORDER BY "accounts"."id" ASC LIMIT 1
ActiveRecord last:
2.2.3 :002 > Account.last
Account Load (0.8ms) SELECT "accounts".* FROM "accounts" ORDER BY "accounts"."id" DESC LIMIT 1
==========
ADDED LATER:
So, I did my own investigation (based on D-side answer) and the Answer is NO. Generally speaking the only guarantee is that first method will return first record from a collection. It may as a side effect add ORDER BY PRIMARY_KEY condition to SQL, but it depends on either records were already loaded into cache/memory or not.
Here's methods extraction from Rails 4.2.4:
/activerecord/lib/active_record/relation/finder_methods.rb
# Find the first record (or first N records if a parameter is supplied).
# If no order is defined it will order by primary key.
# ---> NO, IT IS NOT. <--- This comment is WRONG.
def first(limit = nil)
if limit
find_nth_with_limit(offset_index, limit)
else
find_nth(0, offset_index) # <---- When we get there - `find_nth_with_limit` method will be triggered (and will add `ORDER BY`) only when its `loaded?` is false
end
end
def find_nth(index, offset)
if loaded?
#records[index] # <--- Here's the `problem` where record is just returned by index, no `ORDER BY` is applied to SQL
else
offset += index
#offsets[offset] ||= find_nth_with_limit(offset, 1).first
end
end
Here's a few examples to be clear:
Account.first # True, records are ordered by ID
a = Account.where('free_days > 1') # False, No ordering
a.first # False, no ordering, record simply returned by #records[index]
Account.where('free_days > 1').first # True, Ordered by ID
a = Account.all # False, No ordering
a.first # False, no ordering, record simply returned by #records[index]
Account.all.first # True, Ordered by ID
Now examples with has-many relationship:
Account has_many AccountStatuses, AccountStatus belongs_to Account
a = Account.first
a.account_statuses # No ordering
a.account_statuses.first
# Here is a tricky part: sometimes it returns #record[index] entry, sometimes it may add ORDER BY ID (if records were not loaded before)
Here is my conclusion:
Treat method first as returning a first record from already loaded collection (which may be loaded in any order, i.e. unordered). And if I want to be sure that first method will return record with minimal ID - then a collection upon which I apply first method should be appropriately ordered before.
And Rails documentation about first method is just wrong and need to be rewritten.
http://guides.rubyonrails.org/active_record_querying.html
1.1.3 first
The first method finds the first record ordered by the primary key. <--- No, it is not!

If sorting is not chosen, the rows will be returned in an unspecified
order. The actual order in that case will depend on the scan and join
plan types and the order on disk, but it must not be relied on. A
particular output ordering can only be guaranteed if the sort step is
explicitly chosen.
http://www.postgresql.org/docs/9.4/static/queries-order.html (emphasis mine)
So ActiveRecord actually adds ordering by primary key, whichever that is, to keep the result deterministic. Relevant source code is easy to find using pry, but here are extracts from Rails 4.2.4:
# show-source Thing.all.first
def first(limit = nil)
if limit
find_nth_with_limit(offset_index, limit)
else
find_nth(0, offset_index)
end
end
# show-source Thing.all.find_nth
def find_nth(index, offset)
if loaded?
#records[index]
else
offset += index
#offsets[offset] ||= find_nth_with_limit(offset, 1).first
end
end
# show-source Thing.all.find_nth_with_limit
def find_nth_with_limit(offset, limit)
relation = if order_values.empty? && primary_key
order(arel_table[primary_key].asc) # <-- ATTENTION
else
self
end
relation = relation.offset(offset) unless offset.zero?
relation.limit(limit).to_a
end

it may change depending of your Database engine, it returns always the minimal ID in mysql with first method but it does not works the same for postgresql, I had several issues with this when I was a nobai, my app was working as expected in local with mysql, but everything was messed up when deployed to heroku with postgresql, so for avoid issues with postgresql always order your records by id before the query:
Account.order(:id).first
The above ensures minimal ID for mysql, postgresql and any other database engine as you can see in the query:
SELECT `accounts`.* FROM `accounts` ORDER BY `accounts`.`id` ASC LIMIT 1

I don't think that answer you reference is relevant (even to the question it is on), as it refers to non-ordered querying, whereas first and last do apply an order based on id.
In some cases, where you are applying your own group on the query, you cannot use first or last because an order by cannot be applied if the grouping does not include id, but you can use take instead to just get the first row returned.
There have been versions where first and/or last did not apply the order (one of the late Rails 3 on PostgreSQL as I recall), but they were errors.

Related

Why does `FOR ALL ENTRIES` lower performance of CDS view on DB6?

I'm reading data from a SAP Core Data Service (CDS view, SAP R/3, ABAP 7.50) using a WHERE clause on its primary (and only) key column. There is a massive performance decrease when using FOR ALL ENTRIES (about a factor 5):
Reading data using a normal WHERE clause takes about 10 seconds in my case:
SELECT DISTINCT *
FROM ZMY_CDS_VIEW
WHERE prim_key_col eq 'mykey'
INTO TABLE #DATA(lt_table1).
Reading data using FOR ALL ENTRIES with the same WHERE takes about 50 seconds in my case:
"" boilerplate code that creates a table with one entry holding the same key value as above
TYPES: BEGIN OF t_kv,
key_value like ZMY_CDS_VIEW-prim_key_col,
END OF t_kv.
DATA lt_key_values TYPE TABLE OF t_kv.
DATA ls_key_value TYPE t_kv.
ls_key_value-key_value = 'mykey'.
APPEND ls_key_value TO lt_key_values.
SELECT *
FROM ZMY_CDS_VIEW
FOR ALL ENTRIES IN #lt_key_values
WHERE prim_key_col eq #lt_key_values-key_value
INTO TABLE #DATA(lt_table2).
I do not understand why the same selection takes five times as long when utilising FOR ALL ENTRIES. Since the table lt_key_values has only 1 entry I'd expect the database (sy-dbsys is 'DB6' in my case) to do exactly the same operations plus maybe some small neglectable overhead ≪ 40s.
Selecting from the underlying SQL view instead of the CDS (with its Access Control and so on) makes no difference at all, neither does adding or removing the DISTINCT key word (because FOR ALL ENTRIES implies DISTINCT).
A colleague guessed, that the FOR ALL ENTRIES is actually selecting the entire content of the CDS and comparing it with the internal table lt_key_values at runtime. This seems about right.
Using the transaction st05 I recorded a SQL trace that looks like the following in the FOR ALL ENTRIES case:
SELECT
DISTINCT "ZMY_UNDERLYING_SQL_VIEW".*
FROM
"ZMY_UNDERLYING_SQL_VIEW",
TABLE( SAPTOOLS.MEMORY_TABLE( CAST( ? AS BLOB( 2G )) ) CARDINALITY 1 ) AS "t_00" ( "C_0" VARCHAR(30) )
WHERE
"ZMY_UNDERLYING_SQL_VIEW"."MANDT" = ?
AND "ZMY_UNDERLYING_SQL_VIEW"."PRIM_KEY_COL" = "t_00"."C_0"
[...]
Variables
A0(IT,13) = ITAB[1x1(20)]
A1(CH,10) = 'mykey'
A2(CH,3) = '100'
So what actually happens is: ABAP selects the entire CDS content and puts the value from the internal table in something like an additional column. Then it only keeps those values where internal table and SQL result entry do match. ==> No optimzation on database level => bad performance.

Thinking Sphinx indexing performance

I have a large index definition that takes too long to index. I suspect the main problem is caused by the many LEFT OUTER JOINs generated.
I saw this question, but can't find documentation about using source: :query, which seems to be part of the solution.
My index definition and the resulting query can be found here: https://gist.github.com/jonsgold/fdd7660bf8bc98897612
How can I optimize the generated query to run faster during indexing?
The 'standard' sphinx solution to this would be to use ranged queries.
http://sphinxsearch.com/docs/current.html#ex-ranged-queries
... splitting up the query into lots of small parts, so the database server has a better chance of being able to run the query (rather than one huge query)
But I have no idea how to actully enable that in Thinking Sphinx. Can't see anything in the documentation. Could help you edit the sphinx.conf, but also not sure how TS will cope with you manually editing the config file.
This is the solution that worked best (from the linked question). Basically, you can remove a piece of the main query sql_query and define it separately as a sql_joined_field in the sphinx.conf file.
It's important to add all relevant sql conditions to each sql_joined_field (such as sharding indexes by modulo on the ID). Here's the new definition:
ThinkingSphinx::Index.define(
:incident,
with: :active_record,
delta?: false,
delta_processor: ThinkingSphinx::Deltas.processor_for(ThinkingSphinx::Deltas::ResqueDelta)
) do
indexes "SELECT incidents.id * 51 + 7 AS id, sites.name AS site FROM incidents LEFT OUTER JOIN sites ON sites.id = site_id WHERE incidents.deleted = 0 AND EXISTS (SELECT id FROM accounts WHERE accounts.status = 'enabled' AND incidents.account_id = id) ORDER BY id", as: :site, source: :query
...
has
...
end
ThinkingSphinx::Index.define(
:incident,
with: :active_record,
delta?: true,
delta_processor: ThinkingSphinx::Deltas.processor_for(ThinkingSphinx::Deltas::ResqueDelta)
) do
indexes "SELECT incidents.id * 51 + 7 AS id, sites.name AS site FROM incidents LEFT OUTER JOIN sites ON sites.id = site_id WHERE incidents.deleted = 0 AND incidents.delta = 1 AND EXISTS (SELECT id FROM accounts WHERE accounts.status = 'enabled' AND incidents.account_id = id) ORDER BY id", as: :site, source: :query
...
has
...
end
The magic that defines the field site as a separate query is the option source: :query at the end of the line.
Notice the core index definition has the parameter delta?: false, while the delta index definition has the parameter delta?: true. That's so I could use the condition WHERE incidents.delta = 1 in the delta index and filter out irrelevant records.
I found sharding didn't perform any better, so I reverted to one unified index.
See the whole index definition here: https://gist.github.com/jonsgold/05e2aea640320ee9d8b2.
Important to remember!
The Sphinx document ID offset must be handled manually. That is, whenever an index for another model is added or removed, my calculated document ID will change. This must be updated.
So, in my example, if I added an index for a different model (not :incident), I would have to run rake ts:configure to find out my new offset and change incidents.id * 51 + 7 accordingly.

Linq To Entities - Any VS First VS Exists

I am using Entity Framework and I need to check if a product with name = "xyz" exists ...
I think I can use Any(), Exists() or First().
Which one is the best option for this kind of situation? Which one has the best performance?
Thank You,
Miguel
Okay, I wasn't going to weigh in on this, but Diego's answer complicates things enough that I think some additional explanation is in order.
In most cases, .Any() will be faster. Here are some examples.
Workflows.Where(w => w.Activities.Any())
Workflows.Where(w => w.Activities.Any(a => a.Title == "xyz"))
In the above two examples, Entity Framework produces an optimal query. The .Any() call is part of a predicate, and Entity Framework handles this well. However, if we make the result of .Any() part of the result set like this:
Workflows.Select(w => w.Activities.Any(a => a.Title == "xyz"))
... suddenly Entity Framework decides to create two versions of the condition, so the query does as much as twice the work it really needed to. However, the following query isn't any better:
Workflows.Select(w => w.Activities.Count(a => a.Title == "xyz") > 0)
Given the above query, Entity Framework will still create two versions of the condition, plus it will also require SQL Server to do an actual count, which means it doesn't get to short-circuit as soon as it finds an item.
But if you're just comparing these two queries:
Activities.Any(a => a.Title == "xyz")
Activities.Count(a => a.Title == "xyz") > 0
... which will be faster? It depends.
The first query produces an inefficient double-condition query, which means it will take up to twice as long as it has to.
The second query forces the database to check every item in the table without short-circuiting, which means it could take up to N times longer than it has to, depending on how many items need to be evaluated before finding a match. Let's assume the table has 10,000 items:
If no item in the table matches the condition, this query will take roughly half the time as the first query.
If the first item in the table matches the condition, this query will take roughly 5,000 times longer than the first query.
If one item in the table is a match, this query will take an average of 2,500 times longer than the first query.
If the query is able to leverage an index on the Title and key columns, this query will take roughly half the time as the first query.
So in summary, IF you are:
Using Entity Framework 4 (since newer versions might improve the query structure) Entity Framework 6.1 or earlier (since 6.1.1 has a fix to improve the query), AND
Querying directly against the table (as opposed to doing a sub-query), AND
Using the result directly (as opposed to it being part of a predicate), AND
Either:
You have good indexes set up on the table you are querying, OR
You expect the item not to be found the majority of the time
THEN you can expect .Any() to take as much as twice as long as .Count(). For example, a query might take 100 milliseconds instead of 50. Or 10 instead of 5.
IN ANY OTHER CIRCUMSTANCE .Any() should be at least as fast, and possibly orders of magnitude faster than .Count().
Regardless, until you have determined that this is actually the source of poor performance in your product, you should care more about what's easy to understand. .Any() more clearly and concisely states what you are really trying to figure out, so stick with that.
Any translates into "Exists" at the database level. First translates into Select Top 1 ... Between these, Exists will out perform First because the actual object doesn't need to be fetched, only a Boolean result value.
At least you didn't ask about .Where(x => x.Count() > 0) which requires the entire match set to be evaluated and iterated over before you can determine that you have one record. Any short-circuits the request and can be significantly faster.
One would think Any() gives better results, because it translates to an EXISTS query... but EF is awfully broken, generating this (edited):
SELECT
CASE WHEN ( EXISTS (SELECT
1 AS [C1]
FROM [MyTable] AS [Extent1]
WHERE Condition
)) THEN cast(1 as bit) WHEN ( NOT EXISTS (SELECT
1 AS [C1]
FROM [MyTable] AS [Extent2]
WHERE Condition
)) THEN cast(0 as bit) END AS [C1]
FROM ( SELECT 1 AS X ) AS [SingleRowTable1]
Instead of:
SELECT
CASE WHEN ( EXISTS (SELECT
1 AS [C1]
FROM [MyTable] AS [Extent1]
WHERE Condition
)) THEN cast(1 as bit)
ELSE cast(0 as bit) END AS [C1]
FROM ( SELECT 1 AS X ) AS [SingleRowTable1]
...basically doubling the query cost (for simple queries; it's even worse for complex ones)
I've found using .Count(condition) > 0 is faster pretty much always (the cost is exactly the same as a properly-written EXISTS query)
Ok, I decided to try this out myself. Mind you, I'm using the OracleManagedDataAccess provider with the OracleEntityFramework, but I'm guessing it produces compliant SQL.
I found that First() was faster than Any() for a simple predicate. I'll show the two queries in EF and the SQL that was generated. Mind you, this is a simplified example, but the question was asking whether any, exists or first was faster for a simple predicate.
var any = db.Employees.Any(x => x.LAST_NAME.Equals("Davenski"));
So what does this resolve to in the database?
SELECT
CASE WHEN ( EXISTS (SELECT
1 AS "C1"
FROM "MYSCHEMA"."EMPLOYEES" "Extent1"
WHERE ('Davenski' = "Extent1"."LAST_NAME")
)) THEN 1 ELSE 0 END AS "C1"
FROM ( SELECT 1 FROM DUAL ) "SingleRowTable1"
It's creating a CASE statement. As we know, ANY is merely syntatic sugar. It resolves to an EXISTS query at the database level. This happens if you use ANY at the database level as well. But this doesn't seem to be the most optimized SQL for this query.
In the above example, the EF construct Any() isn't needed here and merely complicates the query.
var first = db.Employees.Where(x => x.LAST_NAME.Equals("Davenski")).Select(x=>x.ID).First();
This resolves to in the database as:
SELECT
"Extent1"."ID" AS "ID"
FROM "MYSCHEMA"."EMPLOYEES" "Extent1"
WHERE ('Davenski' = "Extent1"."LAST_NAME") AND (ROWNUM <= (1) )
Now this looks like a more optimized query than the initial query. Why? It's not using a CASE ... THEN statement.
I ran these trivial examples several times, and in ALMOST every case, (no pun intended), the First() was faster.
In addition, I ran a raw SQL query, thinking this would be faster:
var sql = db.Database.SqlQuery<int>("SELECT ID FROM MYSCHEMA.EMPLOYEES WHERE LAST_NAME = 'Davenski' AND ROWNUM <= (1)").First();
The performance was actually the slowest, but similar to the Any EF construct.
Reflections:
EF Any doesn't exactly map to how you might use Any in the database. I could have written a more optimized query in Oracle with ANY than what EF produced without the CASE THEN statement.
ALWAYS check your generated SQL in a log file or in the debug output window.
If you're going to use ANY, remember it's syntactic sugar for EXISTS. Oracle also uses SOME, which is the same as ANY. You're generally going to use it in the predicate as a replacement for IN. In this case it generates a series of ORs in your WHERE clause. The real power of ANY or EXISTS is when you're using Subqueries and are merely testing for the EXISTENCE of related data.
Here's an example where ANY really makes sense. I'm testing for the EXISTENCE of related data. I don't want to get all of the records from the related table. Here I want to know if there are Surveys that have Comments.
var b = db.Survey.Where(x => x.Comments.Any()).ToList();
This is the generated SQL:
SELECT
"Extent1"."SURVEY_ID" AS "SURVEY_ID",
"Extent1"."SURVEY_DATE" AS "SURVEY_DATE"
FROM "MYSCHEMA"."SURVEY" "Extent1"
WHERE ( EXISTS (SELECT
1 AS "C1"
FROM "MYSCHEMA"."COMMENTS" "Extent2"
WHERE ("Extent1"."SURVEY_ID" = "Extent2"."SURVEY_ID")
))
This is optimized SQL!
I believe the EF does a wonderful job generating SQL. But you have to understand how the EF constructs map to DB constructs else you can create some nasty queries.
And probably the best way to get a count of related data is to do an explicit Load with a Collection Query count. This is far better than the examples provided in prior posts. In this case you're not loading related entities, you're just obtaining a count. Here I'm just trying to find out how many Comments I have for a particular Survey.
var d = db.Survey.Find(1);
var e = db.Entry(d).Collection(f => f.Comments)
.Query()
.Count();
Any() and First() is used with IEnumerable which gives you the flexibility for evaluating things lazily. However Exists() requires List.
I hope this clears things out for you and help you in deciding which one to use.

How to insert autoincremented master/slave records using ScalaQuery?

Classic issue, new framework -- thus problem.
PostgreSQL + Scala + ScalaQuery. I have Master table with serial (autincrement) id and Slave table also with serial id.
I need to insert one master record and several slaves. I have to do it within transaction (to have ability to cancel all), so I cannot run a query after inserting master to find out id. As far as I see SQ "insert" method does not return any reference to inserted master record.
So how to do it?
SQ Examples cover this however without autoincremented field, so such solution (pre-set ids) is not applicable here.
If I understand it correctly this is not possible for now in automatic way. If one is not afraid, this can be done this way. Obtaining the id of last insert (per each master record insertion):
postgreSQL function for last inserted ID
Then using it in SQ:
http://groups.google.com/group/scalaquery/browse_thread/thread/faa7d3e5842da82e
This code shows the MySql way. I'm posting it to the list for
posterity's sake.
val scopeIdentity = SimpleFunction.nullaryLong
val inserted = Actions.insert(
"cat", "eats", "dog)
//Print out the count of inserted records. println(inserted )
//Print out the primary key for the last inserted record.
println(Query(scopeIdentity).first)
//Regards //Bryan
But since for auto incremented fields you have to use projections excluding autoinc fields, and then inserting tuples instead of named record types, there is a question if it is not worth to hold breath until SQ will support this directly.
Note I am SQ newbie, I might just misinform you.

Does DataReader.NextResult retrieves the result is always the same order

I have a SELECT query that yields multiple results and do not have any ORDER BY clause.
If I execute this query multiple times and then iterate through results using DataReader.NextResult(), would I be guaranteed to get the results in the same order?
For e.g. if I execute the following query that return 199 rows:
SELECT * FROM products WHERE productid < 200
would I always get the first result with productid = 1 and so on?
As far as I have observed it always return the results in same order, but I cannot find any documentation for this behavior.
======================================
As per my research:
Check out this blog Conor vs. SQL. I actually wanted to ask if the query-result changes even if the data in table remains the same (i.e no update or delete). But it seems like in case of large table, when SQL server employees parallelism, the order can be different
First of all, to iterate the rows in a DataReader, you should call Read, not NextResult.
Calling NextResult will move to the next result set if your query has multiple SELECT statements.
To answer your question, you must not rely on this.
A query without an ORDER BY clause will return rows in SQL Server's default iteration order.
For small tables, this will usually be the order in which the rows were added, but this is not guaranteed and is liable to change at any time. For example, if the table is indexed or partitioned, the order will be different.
No, DataReader will return the results in the order they come back from SQL. If you don't specify an ORDER BY clause, that will be the order that they exist in the table.
It is possible, perhaps even likely that they will always return in the same order, but this isn't guaranteed. The order is determined by the queryplan (at least in SQL Server) on the database server. If something changes that queryplan, the order could change. You should always use ORDER BY if the order of results is in anyway important to your processing of the data.