Imagine I have a collection of books in Core Data. Each book can have multiple authors. Here's what this imaginary collection of books could look like (it only has one book, just to simplify things):
[
Book(authors: [
Author(first: "John", last: "Monarch"),
Author(first: "Sarah", last: "Monarch")
])
]
I want to filter my collection of books down to only those that have an author whose name is "Sarah Monarch".
From what I've seen so far, if I wanted to write an NSPredicate to filter my collection and return a filtered collection that only contains this book, I could do it by using a SUBQUERY:
NSPredicate(format: "SUBQUERY(authors, $author, $author.#first == 'Sarah' && $author.#last == 'Monarch').#count > 0")
My understanding is that this operation is essentially the same as:
books.filter {
book in
let matchingAuthors = book.authors.filter {
author in
author.first == "John" && author.last == "Monarch"
}
return matchingAuthors.count > 0
}
My problem is that it seems there's some inefficiency here — SUBQUERY (and the example code above) will look at all authors, when we can stop after finding just one that matches. My intuition would lead me to try a predicate like:
ANY (authors.#first == "Sarah" && authors.#last == "Monarch")
Which, as code, could be:
books.filter {
book in
return book.authors.contains {
author in
author.first == "John" && author.last == "Monarch"
}
}
But this predicate's syntax isn't valid.
If I'm right, and the SUBQUERY-based approach is less efficient (because it looks at all elements in a collection, rather than just stopping at the first match) is there a more correct and efficient way to do this?
The following is an alternative approach, which does not use SUBQUERY, but should ultimately have the same results. I've no idea whether this would in practice be more or less efficient than using SUBQUERY.
Your specific concern is the inefficiency of counting all the matching Authors, rather than just stopping when the first matching Author is found. Bear in mind that there is a lot going on behind the scenes whenever a fetch request is processed. First, CoreData has to parse your predicate and turn it into an equivalent SQLite query, which will look something like this:
SELECT 0, t0.Z_PK, t0.Z_OPT, t0.ZNAME, ... FROM ZBOOK t0 WHERE (SELECT COUNT(t1.Z_PK) FROM ZAUTHOR t1 WHERE (t0.Z_PK = t1.ZBOOK AND ( t1.ZFIRST == 'Sarah' AND t1.ZLAST == 'Monarch')) ) > 0
(The precise query will depend on whether the inverse relationship is defined and if so whether it is one-many or many-many; the above is based on the relationship having a to-one inverse, book).
When SQLite is handed that query to execute, it will check what indexes it has available and then invoke the query planner to determine how best to process it. The bit of interest is the subselect to get the count:
SELECT COUNT(t1.Z_PK) FROM ZAUTHOR t1 WHERE (t0.Z_PK = t1.ZBOOK AND ( t1.ZFIRST == 'Sarah' AND t1.ZLAST == 'Monarch'))
This is the bit of the query which corresponds to your first code snippet. Note that it is in SQLite terms a correlated subquery: it includes a parameter (t0.Z_PK - this is essentially the relevant Book) from the outer SELECT statement. SQLite will search the entire table of Authors, looking first to see whether they are related to that Book and then to see whether the author first and last names match. Your proposition is that this is inefficient; the nested select can stop as soon as any matching Author is found. In SQLite terms, that would correspond to a query like this:
SELECT 0, t0.Z_PK, t0.Z_OPT, t0.ZNAME, ... FROM ZBOOK t0 WHERE EXISTS(SELECT 1 FROM ZAUTHOR t1 WHERE (t0.Z_PK = t1.ZBOOK AND ( t1.ZFIRST == 'Sarah' AND t1.ZLAST == 'Monarch')) )
(It's unclear from the SQLite docs whether the EXISTS operator actually shortcuts the underlying subselect, but this answer elsewhere on SO suggests it does. If not it might be necessary to add "LIMIT 1" to the subselect to get it to stop after one row is returned). The problem is, I don't know of any way to craft a CoreData predicate which would be converted into a SQLite query using the EXISTS operator. Certainly it's not listed as a function in the NSExpression documentation, nor is it mentioned (or listed as a reserved word) in the predicate format documentation. Likewise, I don't know of any way to add "LIMIT 1" to the subquery (though it's relatively straightforward to add to the main fetch request using fetchLimit.
So not much scope for addressing the issue you identify. However, there might be other inefficiencies. Scanning the Author table for each Book in turn (the correlated subquery) might be one. Might it be more efficient to scan the Author table once, identify those that meet the relevant criteria (first = "Sarah" and last = "Monarch"), then use that (presumably much shorter) list to search for the books? As I said at the start, that's an open question: I have no idea whether it is or isn't more efficient.
To pass results of one fetch request to another, use NSFetchRequestExpression. It's a bit arcane but hopefully the following code is clear enough:
let authorFetch = Author.fetchRequest()
authorFetch.predicate = NSPredicate(format: "#first == 'Sarah' AND #last == 'Monarch'")
authorFetch.resultType = .managedObjectIDResultType
let contextExp = NSExpression(forConstantValue: self.managedObjectContext)
let fetchExp = NSExpression(forConstantValue: authorFetch)
let fre = NSFetchRequestExpression.expression(forFetch: fetchExp, context: contextExp, countOnly: false)
let bookFetch = Book.fetchRequest()
bookFetch.predicate = NSPredicate(format: "ANY authors IN %#", fre)
let results = try! self.managedObjectContext!.fetch(bookFetch)
The result of using that fetch is a SQL query like this:
SELECT DISTINCT 0, t0.Z_PK, t0.Z_OPT, t0.ZNAME, ... FROM ZBOOK t0 JOIN ZAUTHOR t1 ON t0.Z_PK = t1.ZBOOK WHERE t1.Z_PK IN (SELECT n1_t0.Z_PK FROM ZAUHTOR n1_t0 WHERE ( n1_t0.ZFIST == 'Sarah' AND n1_t0.ZLAST == 'Monarch')
This has its own complexities (a DISTINCT, a JOIN, and a subselect) but importantly the subselect is no longer correlated: it is independent of the outer SELECT so can be evaluated once rather than being re-evaluated for each row of the outer SELECT.
Related
So I work with Postgres SQL, and I have a jsonb column with the following structure:
{
"Store1":[
{
"price":5.99,
"seller":"seller"
},
{
"price":56.43,
"seller":"seller"
}
],
"Store2":[
{
"price":45.65,
"seller":"seller"
},
{
"price":44.66,
"seller":"seller"
}
]
}
I have a jsonb like this for every product in the database. I want to run an SQL query that will answer the following question:
For each product, is one of the prices in this JSON is bigger/equal/smaller than X?
Basically filter the product to include only the ones who have at least one price that satisfies a mathematical condition.
How can I do it efficiently? What's the best way in Postgres to iterate a JSON like this, with a relatively complex inner structure?
Also, if I could control the way the data is structured (to an extent, I can), what changes can I do to make this query more efficient?
Thanks!
Use a json path expression:
WHERE col ## '$.*[*].price < 20'
or
WHERE col #? '$.*[*] ? (#.price < 20)'
If you need to compare to another column or make the query parameterised, you can either build the jsonpath dynamically
WHERE col ## format('$.*[*].price < %s', $1)::jsonpath
WHERE col #? format('$.*[*] ? (#.price < %s)', $1)::jsonpath
or you can use the respective function and pass variables as an object:
WHERE jsonb_path_match(col, '$.*[*].price < $limit', jsonb_build_object('limit', $1))
WHERE jsonb_path_exists(col, format('$.*[*] ? (#.price < $limit)', jsonb_build_object('limit', $1))
I admit I had to check my cheat sheet to figure out the right combination of operator and expression. Takeaways:
if a comparison operator needs to work with multiple values, it generally functions as an ANY
## does not work with ? (# …) filter expressions since they don't return a boolean,
#? does not work with predicates since they always return a value (even if it's false)
What changes can I do to make this query more efficient?
As #jjanes commented on my other answer, the jsonpath match col ## '$.*[*].price < $limit' isn't going to be fast and needs to do full table scan, at least for < and >. To make a useful index, a different approach is required. An index can only have a single value to compare with, not any number. For that, we need to change the condition from EXISTS(SELECT prices_of(col) WHERE price < $limit) to (SELECT MIN(prices_of(col))) < $limit.
With this idea it is possible to build an expression index on the result of a custom immutable function:
CREATE FUNCTION min_price(data jsonb) RETURNS float
LANGUAGE SQL
IMMUTABLE
RETURNS NULL ON NULL INPUT
RETURN (
SELECT min((offer ->> 'price')::float)
FROM jsonb_each(data) AS entries(name, store),
LATERAL jsonb_array_elements(store) AS elements(offer)
);
CREATE INDEX example_min_data_price_idx ON example (min_price(data));
which you can use as
SELECT * FROM example WHERE min_price(data) < 20;
Looking for rows with a price larger than a certain number requires a separate index on max_price(data). If you want to use the index in a JOIN with more conditions, consider making it a multi-column index.
Looking for row with a price equalling a certain number can be optimised by indexing the jsonb column and using a jsonpath:
CREATE INDEX example_data_idx ON example USING GIN (data jsonb_ops);
SELECT * FROM example WHERE data ## '$.*[*].price == 20';
SELECT * FROM example WHERE data #? '$.*[*] ? (#.price == 20)';
Unfortunately you can't use jsonb_path_ops here since that doesn't support the wildcard.
I have an INCLUDE table that I want to check a couple of values, in the same row, using an IN clause. The below doesn't return the correct result set because it produces two EXISTS clauses with subqueries. This results in the 2 values being checked independently and not strictly in the same child row. (forgive any typos as I'm typing this in from printed code)
var db = new dbEntities();
IQueryable<dr> query = db.drs;
// filter the parent table
query = query.Where(p => DropDown1.KeyValue.ToString().Contains(p.system_id.ToString()));
// include the child table
query = query.Include(p => p.drs_versions);
// filter the child table using the other two dropdowns
query = query.Where(p => p.drs_versions.Any(c => DropDown2.KeyValue.ToString().Contains(c.version_id.ToString())) && c => DropDown3.KeyValue.ToString().Contains(c.status_id.ToString()));
// I tried removing the second c=> but received an error "'c' is inaccessible due to its protection level" error and couldn't find an clear answer to how this related to Entity Framework
// query = query.Where(p => p.drs_versions.Any(c => DropDown2.KeyValue.ToString().Contains(c.version_id.ToString())) && DropDown3.KeyValue.ToString().Contains(c.status_id.ToString()));
This is an example of the query the code above produces...
SELECT *
FROM drs d
LEFT OUTER JOIN drs_versions v ON d.dr_id = v.dr_id
WHERE d.system_id IN (9,8,3)
AND EXISTS (SELECT 1 AS C1
FROM drs_versions sub1
WHERE d.tr_id = sub1.tr_id
AND sub1.version_id IN (9, 4, 1))
AND EXISTS (SELECT 1 AS C1
FROM drs_versions sub2
WHERE d.tr_id = sub2.tr_id
AND sub2.status_id IN (12, 7))
This is the query I actually want:
SELECT *
FROM drs d
LEFT OUTER JOIN drs_versions v ON d.dr_id = v.dr_id
WHERE d.system_id IN (9, 8, 3)
AND v.version_id IN (9, 4, 1)
AND v.status_id IN (12, 7)
How do I get Entity Framework to create a query that will give me the desired result set?
Thank you for your help
I'd drop all of the .ToString() everywhere and format your values ahead of the query to make it a lot easier to follow.. If EF is generating SQL anything like what you transcribed, you are casting to String just to have EF revert it back to the appropriate type.
From that it just looks like your parenthesis are a bit out of place:
I'm also not sure how something like DropDown2.KeyBalue.ToString() resolves back to what I'd expect to be a collection of numbers based on your SQL examples... I've just substituted this with a method called getSelectedIds().
IEnumerable<int> versions = getSelectedIds(DropDown2);
IEnumerable<int> statuses = getSelectedIds(DropDown3);
query = query
.Where(p => p.drs_versions
.Any(c => versions.Contains(c.version_id)
&& statuses.Contains(c.status_id));
As a general bit of advice I suggest always looking to simplify the variables you want to use in a linq expression as much as possible ahead of time to keep the text inside the expression as simple to read as possible. (avoiding parenthesis as much as possible) Make liberal use of line breaks and indentation to organize what falls under what, and use the code highlighting to double-check your closing parenthesis that they are closing the opening you expect.
I don't think your first example actually was input correctly as it would result in a compile error as you cannot && c => ... within an Any() block. My guess would be that you have:
query = query.Where(p => p.drs_versions.Any(c => DropDown2.KeyValue.ToString().Contains(c.version_id.ToString())) && p.drs_versions.Any(c => DropDown3.KeyValue.ToString().Contains(c.status_id.ToString()));
Your issue is closing off the inner .Any()
query.Where(p => p.drs_versions.Any(c => DropDown2.KeyValue.Contains(c.version_id))
&& DropDown3.KeyValue.Contains(c.status_id)); //<-- "c" is still outside the single .Any() condition so invalid.
Even then I'm not sure this will fully explain the difference in queries or results. It sounds like you've tried typing across code rather than pasting the actual statements and captured EF queries. It may help to copy the exact statements from the code because it's pretty easy to mistype something when trying to simplify an example only to find out you've accidentally excluded the smoking gun for your issue.
I am currently using EFCore 1.1 (preview release) with SQL Server.
I am doing what I thought was a simple OUTER JOIN between an Order and OrderItem table.
var orders = from order in ctx.Order
join orderItem in ctx.OrderItem
on order.OrderId equals orderItem.OrderId into tmp
from oi in tmp.DefaultIfEmpty()
select new
{
order.OrderDt,
Sku = (oi == null) ? null : oi.Sku,
Qty = (oi == null) ? (int?) null : oi.Qty
};
The actual data returned is correct (I know earlier versions had issues with OUTER JOINS not working at all). However the SQL is horrible and includes every column in Order and OrderItem which is problematic considering one of them is a large XML Blob.
SELECT [order].[OrderId], [order].[OrderStatusTypeId],
[order].[OrderSummary], [order].[OrderTotal], [order].[OrderTypeId],
[order].[ParentFSPId], [order].[ParentOrderId],
[order].[PayPalECToken], [order].[PaymentFailureTypeId] ....
...[orderItem].[OrderId], [orderItem].[OrderItemType], [orderItem].[Qty],
[orderItem].[SKU] FROM [Order] AS [order] LEFT JOIN [OrderItem] AS
[orderItem] ON [order].[OrderId] = [orderItem].[OrderId] ORDER BY
[order].[OrderId]
(There are many more columns not shown here.)
On the other hand - if I make it an INNER JOIN then the SQL is as expected with only the columns in my select clause:
SELECT [order].[OrderDt], [orderItem].[SKU], [orderItem].[Qty] FROM
[Order] AS [order] INNER JOIN [OrderItem] AS [orderItem] ON
[order].[OrderId] = [orderItem].[OrderId]
I tried reverting to EFCore 1.01, but got some horrible nuget package errors and gave up with that.
Not clear whether this is an actual regression issue or an incomplete feature in EFCore. But couldn't find any further information about this elsewhere.
Edit: EFCore 2.1 has addressed a lot of issues with grouping and also N+1 type issues where a separate query is made for every child entity. Very impressed with the performance in fact.
3/14/18 - 2.1 Preview 1 of EFCore isn't recommended because the GROUP BY SQL has some issues when using OrderBy() but it's fixed in nightly builds and Preview 2.
The following applies to EF Core 1.1.0 (release).
Although shouldn't be doing such things, tried several alternative syntax queries (using navigation property instead of manual join, joining subqueries containing anonymous type projection, using let / intermediate Select, using Concat / Union to emulate left join, alternative left join syntax etc.) The result - either the same as in the post, and/or executing more than one query, and/or invalid SQL queries, and/or strange runtime exceptions like IndexOutOfRange, InvalidArgument etc.
What I can say based on tests is that most likely the problem is related to bug(s) (regression, incomplete implementation - does it really matter) in GroupJoin translation. For instance, #7003: Wrong SQL generated for query with group join on a subquery that is not present in the final projection or #6647 - Left Join (GroupJoin) always materializes elements resulting in unnecessary data pulling etc.
Until it get fixed (when?), as a (far from perfect) workaround I could suggest using the alternative left outer join syntax (from a in A from b in B.Where(b = b.Key == a.Key).DefaultIfEmpty()):
var orders = from o in ctx.Order
from oi in ctx.OrderItem.Where(oi => oi.OrderId == o.OrderId).DefaultIfEmpty()
select new
{
OrderDt = o.OrderDt,
Sku = oi.Sku,
Qty = (int?)oi.Qty
};
which produces the following SQL:
SELECT [o].[OrderDt], [t1].[Sku], [t1].[Qty]
FROM [Order] AS [o]
CROSS APPLY (
SELECT [t0].*
FROM (
SELECT NULL AS [empty]
) AS [empty0]
LEFT JOIN (
SELECT [oi0].*
FROM [OrderItem] AS [oi0]
WHERE [oi0].[OrderId] = [o].[OrderId]
) AS [t0] ON 1 = 1
) AS [t1]
As you can see, the projection is ok, but instead of LEFT JOIN it uses strange CROSS APPLY which might introduce another performance issue.
Also note that you have to use casts for value types and nothing for strings when accessing the right joined table as shown above. If you use null checks as in the original query, you'll get ArgumentNullException at runtime (yet another bug).
Using "into" will create a temporary identifier to store the results.
Reference : MDSN: into (C# Reference)
So removing the "into tmp from oi in tmp.DefaultIfEmpty()" will result in the clean sql with the three columns.
var orders = from order in ctx.Order
join orderItem in ctx.OrderItem
on order.OrderId equals orderItem.OrderId
select new
{
order.OrderDt,
Sku = (oi == null) ? null : oi.Sku,
Qty = (oi == null) ? (int?) null : oi.Qty
};
I'm using Flask-SQLAlchemy with PostgreSQL. I have the following two models:
class Course(db.Model):
id = db.Column(db.Integer, primary_key = True )
course_name =db.Column(db.String(120))
course_description = db.Column(db.Text)
course_reviews = db.relationship('Review', backref ='course', lazy ='dynamic')
class Review(db.Model):
__table_args__ = ( db.UniqueConstraint('course_id', 'user_id'), { } )
id = db.Column(db.Integer, primary_key = True )
review_date = db.Column(db.DateTime)#default=db.func.now()
review_comment = db.Column(db.Text)
rating = db.Column(db.SmallInteger)
course_id = db.Column(db.Integer, db.ForeignKey('course.id') )
user_id = db.Column(db.Integer, db.ForeignKey('user.id') )
I want to select the courses that are most reviewed starting with at least two reviews. The following SQLAlchemy query worked fine with SQlite:
most_rated_courses = db.session.query(models.Review, func.count(models.Review.course_id)).group_by(models.Review.course_id).\
having(func.count(models.Review.course_id) >1) \ .order_by(func.count(models.Review.course_id).desc()).all()
But when I switched to PostgreSQL in production it gives me the following error:
ProgrammingError: (ProgrammingError) column "review.id" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: SELECT review.id AS review_id, review.review_date AS review_...
^
'SELECT review.id AS review_id, review.review_date AS review_review_date, review.review_comment AS review_review_comment, review.rating AS review_rating, review.course_id AS review_course_id, review.user_id AS review_user_id, count(review.course_id) AS count_1 \nFROM review GROUP BY review.course_id \nHAVING count(review.course_id) > %(count_2)s ORDER BY count(review.course_id) DESC' {'count_2': 1}
I tried to fix the query by adding models.Review in the GROUP BY clause but it did not work:
most_rated_courses = db.session.query(models.Review, func.count(models.Review.course_id)).group_by(models.Review.course_id).\
having(func.count(models.Review.course_id) >1) \.order_by(func.count(models.Review.course_id).desc()).all()
Can anyone please help me with this issue. Thanks a lot
SQLite and MySQL both have the behavior that they allow a query that has aggregates (like count()) without applying GROUP BY to all other columns - which in terms of standard SQL is invalid, because if more than one row is present in that aggregated group, it has to pick the first one it sees for return, which is essentially random.
So your query for Review basically returns to you the first "Review" row for each distinct course id - like for course id 3, if you had seven "Review" rows, it's just choosing an essentially random "Review" row within the group of "course_id=3". I gather the answer you really want, "Course", is available here because you can take that semi-randomly selected Review object and just call ".course" on it, giving you the correct Course, but this is a backwards way to go.
But once you get on a proper database like Postgresql you need to use correct SQL. The data you need from the "review" table is just the course_id and the count, nothing else, so query just for that (first assume we don't actually need to display the counts, that's in a minute):
most_rated_course_ids = session.query(
Review.course_id,
).\
group_by(Review.course_id).\
having(func.count(Review.course_id) > 1).\
order_by(func.count(Review.course_id).desc()).\
all()
but that's not your Course object - you want to take that list of ids and apply it to the course table. We first need to keep our list of course ids as a SQL construct, instead of loading the data - that is, turn it into a derived table by converting the query into a subquery (change the word .all() to .subquery()):
most_rated_course_id_subquery = session.query(
Review.course_id,
).\
group_by(Review.course_id).\
having(func.count(Review.course_id) > 1).\
order_by(func.count(Review.course_id).desc()).\
subquery()
one simple way to link that to Course is to use an IN:
courses = session.query(Course).filter(
Course.id.in_(most_rated_course_id_subquery)).all()
but that's essentially going to throw away the "ORDER BY" you're looking for and also doesn't give us any nice way of actually reporting on those counts along with the course results. We need to have that count along with our Course so that we can report it and also order by it. For this we use a JOIN from the "course" table to our derived table. SQLAlchemy is smart enough to know to join on the "course_id" foreign key if we just call join():
courses = session.query(Course).join(most_rated_course_id_subquery).all()
then to get at the count, we need to add that to the columns returned by our subquery along with a label so we can refer to it:
most_rated_course_id_subquery = session.query(
Review.course_id,
func.count(Review.course_id).label("count")
).\
group_by(Review.course_id).\
having(func.count(Review.course_id) > 1).\
subquery()
courses = session.query(
Course, most_rated_course_id_subquery.c.count
).join(
most_rated_course_id_subquery
).order_by(
most_rated_course_id_subquery.c.count.desc()
).all()
A great article I like to point out to people about GROUP BY and this kind of query is SQL GROUP BY techniques which points out the common need for the "select from A join to (subquery of B with aggregate/GROUP BY)" pattern.
I am looking at some EF examples and trying to decipher what 'Query Projection' exactly equates to when doing LINQ to Entities or EntitySQL. I believe it is when the query results are filtered and projected into an anonymous type but not 100% sure.
Can someone please define this and maybe provide a small L2E query that uses an example of it?
Projection is when the result of a query is output to a different type than the one queried. Another article defined it as : the process of transforming the results of a query
Projection can be to an anonymous type, but could also be to a concrete type. If you come from a SQL world, it is akin to the columns listed in your SELECT clause.
Example selecting a sub-set of an object into an concrete type:
ParentObj.Select(x=> new ParentSlim { ParentID = x.ParentID, Name = x.Name } );
.
Example merging to object into a 3rd anonymous type:
Note: the select new portion is the projection.
from P in ParentObj.AsQueryable()
join C in ChildObj.AsQueryable() on P.ParentID == C.ParentID
select new { // <-- look ma, i'm projecting!
ParentID = P.ParentID,
Name = P.Name,
SubName = C.Name
RandomDate = DateTime.UtcNow()
}