The following APEX code:
for (List<MyObject__c> list : [select id,some_fields from MyObject__c]) {
for (MyObject__c item : list) {
//do stuff
}
}
Is a standard pattern for processing large quantitues of data - it will automatically break up a large result set into chucks of 200 records, so each List will contain 200 objects.
Is there any difference between the above approach, and below:
for (List<MyObject__c> list : Database.query('select...')) {
for (MyObject__c item : list) {
//do stuff
}
}
used when the SOQL needs to by dynamic?
I have been told that the 2nd approach is losing data, but I'm not sure of the details, and the challenge is to work out why, and to prevent data loss.
Where did you read that using Dynamic SOQL will result in data loss in this scenario? This is untrue AFAIK. The static Database.query() method behaves exactly the same as static SOQL, except for a few small differences.
To answer your first question, the main difference between your approaches is that the first is static (the query is fixed), and the second allows you to dynamically define the query, as the name suggests. Using your second approach with Dynamic SOQL will still chunk the result records for you.
This difference does lead to some other subtle considerations - Dynamic SOQL doesn't support variable bindings other than basic operations. See Daniel Ballinger's idea for this feature here. I also need to add the necessary security caveat - if you're using Dynamic SOQL, do not construct the query based on input, as that would render your application vulnerable to SOQL injection.
Also, I'm just curious, what is your use case/how large of quantities of data are you talking about here? It might be better to use batch Apex, depending on your requirement.
Related
Quite frequently, I'd like to retrieve only the first N rows from a query, but I don't know in advance what N will be. For example:
try(var stream = sql.selectFrom(JOB)
.where(JOB.EXECUTE_AFTER.le(clock.instant()))
.orderBy(JOB.PRIORITY.asc())
.forUpdate()
.skipLocked()
.fetchSize(100)
.fetchLazy()) {
// Now run as many jobs as we can in 10s
...
}
Now, without adding an arbitrary LIMIT clause, the PG query planner sometimes decides to do a painfully slow sequential table scan for such queries, AFAIK because it thinks I'm going to fetch every row in the result set. An arbitrary LIMIT kind of works for simple cases like this one, but I don't like it at all because:
the limit's only there to "trick" the query planner into doing the right thing, it's not there because my intent is to fetch at most N rows.
when it gets a little more sophisticated and you have multiple such queries that somehow depend on each other, choosing an N large enough to not break your code can be hard. You don't want to be the next person who has to understand that code.
finding out that the query is unexpectedly slow usually happens in production where your tables contain a few million/billion rows. Totally avoidable if only the DB didn't insist on being smarter than the developer.
I'm getting tired of writing detailed comments that explain why the queries have to look like this-n-that (i.e. explain why the query planner screws up if I don't add this arbitrary limit)
So, how do I tell the query planner that I'm only going to fetch a few rows and getting the first row quickly is the priority here? Can this be achieved using the JDBC API/driver?
(Note: I'm not looking for server configuration tweaks that indirectly influence the query planner, like tinkering with random page costs, nor can I accept a workaround like set seq_scan=off)
(Note 2: I'm using jOOQ in the example code for clarity, but under the hood this is just another PreparedStatement using ResultSet.TYPE_FORWARD_ONLY and ResultSet.CONCUR_READ_ONLY, so AFAIK we can rule out wrong statement modes)
(Note 3: I'm not in autocommit mode)
PostgreSQL is smarter than you think. All you need to do is to set the fetch size of the java.sql.Statement to a value different from 0 using the setFetchSize method. Then the JDBC driver will create a cursor and fetch the result set in chunks. Any query planned that way will be optimized for fast fetching of the first 10% of the data (this is governed by the PostgreSQL parameter cursor_tuple_fraction). Even if the query performs a sequential scan of a table, not all the rows have to be read: reading will stop as soon as no more result rows are fetched.
I have no idea how to use JDBC methods with your ORM, but there should be a way.
In case the JDBC fetchSize() method doesn't suffice as a hint to get the behaviour you want, you could make use of explicit server side cursors like this:
ctx.transaction(c -> {
c.dsl().execute("declare c cursor for {0}", dsl
.selectFrom(JOB)
.where(JOB.EXECUTE_AFTER.le(clock.instant()))
.orderBy(JOB.PRIORITY.asc())
.forUpdate()
.skipLocked()
);
try {
JobRecord r;
while ((r = dsl.resultQuery("fetch forward 1 from c")
.coerce(JOB)
.fetchOne()) != null) {
System.out.println(r);
}
}
finally {
c.dsl().execute("close c");
}
});
There's a pending feature request to support the above also in the DSL API (see #11407), but the above example shows that this can still be done in a type safe way using plain SQL templating and the ResultQuery::coerce method
This is a scalability related question.
We want to read some rows from a table and, after processing some of them, stop the query. The stop criteria is data dependent (we do not know in advance how many or what rows are we interested in).
This is scalability sensitive when the number of rows of the table grows far beyond the number of rows we really are interested in.
If we use the standard PQExec, all rows are returned and we are forced to consume them (we have to call PQGetResult until it returns null). So this does not scale.
We are now trying "row by row" reading.
We first used PQsendQuery and PQsetSingleRowMode. However, we still have to call PQGetResult until it returns null.
Our last approach is PQsendQuery, PQsetSingleRowMode and when we are done we cancel the query as follows
void CloseRowByRow() {
PGcancel *c = PQgetCancel(conn);
char errbuf[256];
PQcancel(c, errbuf, 256);
PQfreeCancel(c);
while (res) {
PQclear(res);
res = PQgetResult(conn);
}
}
This produces some performance benefits but we are wondering if this is the best we can do.
So here comes the question: Is there any other way?
Use DECLARE and FETCH to define & read from a server-side cursor, this is exactly what they are meant for. You would use standard APIs, FETCH will just let you retrieve the results in batches of a controlled size. See the examples in the docs for more details.
I have a cq5 component that needs to query a given path for a couple of other component types like this:
String query = "select * from nt:unstructured where jcr:path like '/content/some/path/%' and ( contains(sling:resourceType, 'resourceType1') or contains(sling:resourceType, 'resourceType2')) ";
Iterator<Resource> resources = resourceResolver.findResources( query,"sql");
Unfortunately if it is working through a path with a lot of content the page times out. Is there any way to optimize a function like this or tips on improving performance?
1. Use some more specific JCR type than nt:unstructured.
I guess you are looking for page nodes, so try cq:Page or (even better) cq:PageContent.
2. Denormalize your data.
If I understand your query correctly, it should return pages containing resource1 or resource2. Rather than using contains() predicate, which is very costly and prevents JCR from using index, mark pages containing these resources with an additional attribute. Eg., set jcr:content/containsResource1 and jcr:content/containsResource2 properties appropriate and then use them in your query:
select * from cq:PageContent where (containsResource1 is not null or containsResource2 is not null) and jcr:path like '/content/some/path/%'
You could use EventHandler or SlingPostProcessor to set the properties automatically when the resource1 or resource2 is added.
I have added "jackrabbit" and "jcr" tags to your question - I'm not an expert in JCR queries but one of those experts might want to comment on the query statement that you are using and if and how that can be optimized.
That being said, your "page times out" statement seems to imply that it's the client browser that is timing out as it does not receive data for too long. I would first check (with a debugger or log statements) if it's really the findResources call that takes too long, or if it's code that runs after that that's the culprit.
If findResources is slow you'll need to optimize the query or redesign your code to make it asynchronous, for example have the client code first get the HTML page and then get the query results via asynchronous calls.
If code that runs after findResources causes the timeout, you might redesign it to start sending data to the browser as soon as possible, and flush the output regularly to avoid timeouts. But if you're finding lots of results that might take too long for the user anyway and a more asynchronous behavior would then be needed as well.
In our EF implementation for Brand New Project, we have GetAll method in Repository. But, as our Application Grows, we are going to have let's say 5000 Products, Getting All the way it is, would be Performance Problem. Wouldn't it ?
if So, what is good implementation to tackle this future problem?
Any help will be hightly appreciated.
It could become a performance problem if get all is enumerating on the collection, thus causing the entire data set to be returned in an IEnumerable. This all depends on the number of joins, the size of the data set, if you have lazy loading enabled, how SQL Server performs, just to name a few.
Some people will say this is not desired, you could have GetAll() return an IQueryable which would defer the query until something caused the collection to be filled by going to SQL. That way you could filter the results with other Where(), Take(), Skip(), etc statements to allow for paging without having to retrieve all 5000+ products from the database.
It depends on how your repository class is set up. If you're performing the query immediately, i.e. if your GetAll() method returns something like IEnumerable<T> or IList<T>, then yes, that could easily be a performance problem, and you should generally avoid that sort of thing unless you really want to load all records at once.
On the other hand, if your GetAll() method returns an IQueryable<T>, then there may not be a problem at all, depending on whether you trust the people writing queries. Returning an IQueryable<T> would allow callers to further refine the search criteria before the SQL code is actually generated. Performance-wise, it would only be a problem if developers using your code didn't apply any filters before executing the query. If you trust them enough to give them enough rope to hang themselves (and potentially take your database performance down with them), then just returning IQueryable<T> might be good enough.
If you don't trust them, then, as others have pointed out, you could limit the number of records returned by your query by using the Skip() and Take() extension methods to implement simple pagination, but note that it's possible for records to slip through the cracks if people make changes to the database before you move on to the next page. Making pagination work seamlessly with an ever-changing database is much harder than a lot of people think.
Another approach would be to replace your GetAll() method with one that requires the caller to apply a filter before returning results:
public IQueryable<T> GetMatching<T>(Expression<Func<T, bool>> filter)
{
// Replace myQuery with Context.Set<T>() or whatever you're using for data access
return myQuery.Where(filter);
}
and then use it like var results = GetMatching(x => x.Name == "foo");, or whatever you want to do. Note that this could be easily bypassed by calling GetMatching(x => true), but at least it makes the intention clear. You could also combine this with the first method to put a firm cap on the number of records returned.
My personal feeling, though, is that all of these ways of limiting queries are just insurance against bad developers getting their hands on your application, and if you have bad developers working on your project, they'll find a way to cause problems no matter what you try to do. So my vote is to just return an IQueryable<T> and trust that it will be used responsibly. If people abuse it, take away the GetAll() method and give them training-wheels methods like GetRecentPosts(int count) or GetPostsWithTag(string tag, int count) or something like that, where the query logic is out of their hands.
One way to improve this is by using pagination
context.Products.Skip(n).Take(m);
What your referring to is known as paging, and it's pretty trivial to do using LINQ via the Skip/Take methods.
EF queries are lazily loaded which means they won't actually hit the database until they are evaluated so the following would only pull the first 10 rows after skipping the first 10
context.Table.Skip(10).Take(10);
Using DBIx::Class and I have a resultset which needs to be filtered by data which cannot be generated by SQL. What I need to do is something effectively equivalent to this hypothetical example:
my $resultset = $schema->resultset('Service')->search(\%search);
my $new_resultset = $resultset->filter( sub {
my $web_service = shift;
return $web_service->is_available;
} );
Reading through the docs gives me no clue how to accomplish a strategy like this.
You can’t really, due to the goals for which DBIC result sets are designed:
They compile down to SQL and run a single query, which they do no earlier than when you ask for results.
They are composable.
Allowing filtering by code that runs on the Perl side would make it extremely hairy to achieve those properties, and would hide the fact that such result sets actually run N queries when composed.
Why do you want this, anyway? Why is simply retrieving the results and filtering them yourself insufficient?
Encapsulation? (Eg. hiding the filtering logic in your business logic layer but kicking off the query in the display logic layer.) Then write a custom ResultSet subclass that has an accessor that runs the query and does the desired filtering.
Overhead? (Eg. you will reject most results so you don’t want the overhead of creating objects for them.) Then use HashRefInflator.
If you filter the results and end up with a list of rows you can create a new resultset like this: http://search.cpan.org/~abraxxa/DBIx-Class-0.08127/lib/DBIx/Class/Manual/Cookbook.pod#Creating_a_result_set_from_a_set_of_rows.
This may keep things consistent in keeping the results as a resultset but I imagine you would not be able to chain it or use any other resultset methods on it.