q.setOrder(...) extremely slow in DataNucleus + MongoDB - mongodb

I have a latest stable DataNucleus (3.0.1) with MongoDB datastore and JDO implementation.
The collection has around 1M documents.
The "id" field is indexed.
This code takes several minutes to execute:
Query q = pm.newQuery(CellMeasurement.class);
q.setOrdering("id descending");
q.setRange(0, count);
Collection<CellMeasurement> result = (Collection<CellMeasurement>)q.execute();
if I remove the q.setOrdering(...) everything is ok, for count=1000 it takes around a second to load.
It looks like DN does the in-memory reordering, does it have any sense ? MongoDB itself orders instantly by this indexed field, the API supports ordering..
Any idea ? Thanks.

Looking in the log (for any application) would obviously reveal much; in this case what query is actually performed. In this case it would tell you easily enough that ordering is not currently implemented to execute in-datastore.
Obviously anybody could contribute to such codebase that has been open source since its first release. org.datanucleus.store.mongodb.query.QueryToMongoDBMapper method "compileOrdering" is where you need to implement that, and then attach a patch to a JIRA issue when you're done. Thx

Related

How to tell PostgreSQL via JDBC that I'm not going to fetch every row of the query result (i.e. how to stream the head of a result set efficiently)?

Quite frequently, I'd like to retrieve only the first N rows from a query, but I don't know in advance what N will be. For example:
try(var stream = sql.selectFrom(JOB)
.where(JOB.EXECUTE_AFTER.le(clock.instant()))
.orderBy(JOB.PRIORITY.asc())
.forUpdate()
.skipLocked()
.fetchSize(100)
.fetchLazy()) {
// Now run as many jobs as we can in 10s
...
}
Now, without adding an arbitrary LIMIT clause, the PG query planner sometimes decides to do a painfully slow sequential table scan for such queries, AFAIK because it thinks I'm going to fetch every row in the result set. An arbitrary LIMIT kind of works for simple cases like this one, but I don't like it at all because:
the limit's only there to "trick" the query planner into doing the right thing, it's not there because my intent is to fetch at most N rows.
when it gets a little more sophisticated and you have multiple such queries that somehow depend on each other, choosing an N large enough to not break your code can be hard. You don't want to be the next person who has to understand that code.
finding out that the query is unexpectedly slow usually happens in production where your tables contain a few million/billion rows. Totally avoidable if only the DB didn't insist on being smarter than the developer.
I'm getting tired of writing detailed comments that explain why the queries have to look like this-n-that (i.e. explain why the query planner screws up if I don't add this arbitrary limit)
So, how do I tell the query planner that I'm only going to fetch a few rows and getting the first row quickly is the priority here? Can this be achieved using the JDBC API/driver?
(Note: I'm not looking for server configuration tweaks that indirectly influence the query planner, like tinkering with random page costs, nor can I accept a workaround like set seq_scan=off)
(Note 2: I'm using jOOQ in the example code for clarity, but under the hood this is just another PreparedStatement using ResultSet.TYPE_FORWARD_ONLY and ResultSet.CONCUR_READ_ONLY, so AFAIK we can rule out wrong statement modes)
(Note 3: I'm not in autocommit mode)
PostgreSQL is smarter than you think. All you need to do is to set the fetch size of the java.sql.Statement to a value different from 0 using the setFetchSize method. Then the JDBC driver will create a cursor and fetch the result set in chunks. Any query planned that way will be optimized for fast fetching of the first 10% of the data (this is governed by the PostgreSQL parameter cursor_tuple_fraction). Even if the query performs a sequential scan of a table, not all the rows have to be read: reading will stop as soon as no more result rows are fetched.
I have no idea how to use JDBC methods with your ORM, but there should be a way.
In case the JDBC fetchSize() method doesn't suffice as a hint to get the behaviour you want, you could make use of explicit server side cursors like this:
ctx.transaction(c -> {
c.dsl().execute("declare c cursor for {0}", dsl
.selectFrom(JOB)
.where(JOB.EXECUTE_AFTER.le(clock.instant()))
.orderBy(JOB.PRIORITY.asc())
.forUpdate()
.skipLocked()
);
try {
JobRecord r;
while ((r = dsl.resultQuery("fetch forward 1 from c")
.coerce(JOB)
.fetchOne()) != null) {
System.out.println(r);
}
}
finally {
c.dsl().execute("close c");
}
});
There's a pending feature request to support the above also in the DSL API (see #11407), but the above example shows that this can still be done in a type safe way using plain SQL templating and the ResultQuery::coerce method

Materialize partial set of results with EF Core 2.1

Let's say I have a large collection of tasks stored in DB and I want to retrieve the latest one according to requesting user's permissions. The permissions checking logic is complex and not related to the persistence layer, hence I can't put it in an SQL query. What I'm doing today is retrieving ALL tasks from DB ordered by descending date, then filter them by permissions set and taking first one. Not a perfect solution: I retrieve thousands of objects when I need only one.
My question is: how can I materialize objects coming from DB until I find one that matches my criteria and discards rest of results?
I thought about one solution, but couldn't find information regarding EF Core behavior in this case and don't know how to check it myself:
Build the IQueryable, cast to IEnumerable, then iterate over it and take the first good task. I know that IQueryable part will be executed on Server and IEnumerable on the client, but I don't know if all task will be materialized before applying FilterByPermissions or it will be performed by demand? And I also don't like the synchronous nature of this solution.
IQueryable<MyTask> allTasksQuery = ...;
IEnumerable<MyTask> allTasksEnumerable = allTasksQuery.AsEnumerable();
IEnumerable<MyTask> filteredTasks = FilterByPermissions(allTasksEnumerable);
MyTask latestTask = filteredTasks.FirstOrDefault();
The workaround could be retrieving small sets of data (pages of 50 for example) until one good task is found but I don't like it.

Getting Spring Data MongoDB to recreate indexes

I'd like to be able to purge the database of all data between Integration test executions. My first thought was to use an org.springframework.test.context.support.AbstractTestExecutionListener
registered using the #TestExecutionListeners annotation to perform the necessary cleanup between tests.
In the afterTestMethod(TestContext testContext) method I tried getting the database from the test context and using the com.mongodb.DB.drop() method. This worked ok, apart from the fact that it also destroys the indexes that were automatically created by Spring Data when it first bound my managed #Document objects.
For now I have fixed this by resorting to iterating through the collection names and calling remove as follows:
for (String collectionName : database.getCollectionNames()) {
if (collectionIsNotASystemCollection(collectionName)
database.getCollection(collectionName).remove(new BasicDBObject());
}
This works and achieves the desired result - but it'd be nice if there was a way I could simply drop the database and just ask Spring Data to "rebind" and perform the same initialisation that it did when it started up to create all of the necessary indexes. That feels a bit cleaner and safer...
I tried playing around with the org.springframework.data.mongodb.core.mapping.MongoMappingContext but haven't yet managed to work out if there is a way to do what I want.
Can anyone offer any guidance?
See this ticket for an explanation why it currently works as it works and why working around this issue creates more problems than it solves.
Supposed you're working with Hibernate and then trigger a call to delete the database, would you even dream to assume that the tables and all indexes reappear magically? If you drop a MongoDB database/collection you remove all metadata associated with it. Thus, you need to set it up the way you'd like it to work.
P.S.: I am not sure we did ourselves a favor to add automatic indexing support as this of course triggers the expectations that you now have :). Feel free to comment on the ticket if you have suggestions how this could be achieved without the downsides I outlined in my initial comment.

What is the best way to handle Autocomplete with a million of records efficiently?

I've just started doing some Java EE projects right now I'm trying to update my existing auto complete field. I'm using Primeface for JSF, and I'm using JPA.
My auto complete is working fine. My problem now is that the records have grown to a million and with that my current code is now producing an out of memory/heap space problem due to the large List generated.
I generate the List on ejb with #Startup since the data does not change. It just usually grow when we add more data directly to the database. Also when I try loading this to a Managed Bean with a #Postconstruct I get the out of memory faster.
I'm using this to populate my list in the ejb
private List<TranslationAutoComplete> translations;
this.translations = em.createQuery("SELECT NEW com.sample.model.TranslationAutoComplete(t.id, t.entry) FROM Translation t ORDER BY t.entry ASC", TranslationAutoComplete.class).getResultList();
With that what can be the better structure for me to efficiently handle this without producing some memory/ heap space problem? I've read about Memcache? and other non java core Colletions but haven't tried it yet. Are these the better solution? Or is there a more efficient way in doing this in Java EE.
You have to limit the query results. You can achieve this with the setMaxResults(int maxResults) for your Query. If you do this:
this.translations = em
.createQuery("SELECT NEW com.sample.model.TranslationAutoComplete(t.id, t.entry)
FROM Translation t ORDER BY t.entry ASC", TranslationAutoComplete.class)
.setMaxResults(10).getResultList();
You will get only 10 results. If the user keeps inputting data on your autocomplete, then more results will appear, making it a lot more efficient. Cheers.
First just go through Jprofiler and check which specific data is cause of outOfMemoryException. It might be possible that some other Objects like number of connection taking a lots of memory resource.
The large data is realy like a time bomb for managed bean. As I prefer load only fixed limit of record in List of your #Postconstruct initially and fetch required record in diffrent service request(if there is million of records).
You can also increase the Run time memory space of your application server but this is not logical and not a good approach to handle this problem.

MongoDB: What's a good way to get a list of all unique tags?

What's the best way to keep track of unique tags for a collection of documents millions of items large? The normal way of doing tagging seems to be indexing multikeys. I will frequently need to get all the unique keys, though. I don't have access to mongodb's new "distinct" command, either, since my driver, erlmongo, doesn't seem to implement it, yet.
Even if your driver doesn't implement distinct, you can implement it yourself. In JavaScript (sorry, I don't know Erlang, but it should translate pretty directly) can say:
result = db.$cmd.findOne({"distinct" : "collection_name", "key" : "tags"})
So, that is: you do a findOne on the "$cmd" collection of whatever database you're using. Pass it the collection name and the key you want to run distinct on.
If you ever need a command your driver doesn't provide a helper for, you can look at http://www.mongodb.org/display/DOCS/List+of+Database+Commands for a somewhat complete list of database commands.
I know this is an old question, but I had the same issue and could not find a real solution in PHP for it.
So I came up with this:
http://snipplr.com/view/59334/list-of-keys-used-in-mongodb-collection/
John, you may find it useful to use Variety, an open source tool for analyzing a collection's schema: https://github.com/jamescropcho/variety
Perhaps you could run Variety every N hours in the background, and query the newly-created varietyResults database to retrieve a listing of unique keys which begin with a given string (i.e. are descendants of a specific parent).
Let me know if you have any questions, or need additional advice.
Good luck!