How to set rows limit on a vert.x select query - vert.x

I'm using vert.x Database client and looking for a way to limit the number of rows of a SELECT query, without using SQL-extension-specific clause (such as TOP 10). The option that comes to mind is something like java.sql.PreparedStatement.setMaxRows().

There is no such option on the Vert.x JDBC client.
Instead, update the SQL query to limit the number or rows return.

Related

JPA Specification count with upper limit

I'm using a readonly JPA Entity MyView built on top of a complicated PostgreSQL View my_view that uses Unions of different tables:
#Entity
#Immutable
#Data
public class MyView {
#Id
private Long id;
private String name;
}
I'm also using (Spring Data) JPA Specifications on this Entity to build dynamic queries. Finally I'm using the preexisting query method to execute:
Page<MyView> findAll(Specification<MyView> spec, Pageable pageable);
All of this works perfectly fine - but the View is so complicated it requires a full scan when sorting OR counting. By introducing indices on all relevant tables and columns I got the execution planner to use at least Index Scans when filtering, but Index-Only Scans seem to be prevented by the underlying Unions of different tables.
This means it still needs to load all rows that match the filter criteria into memory to rescan for sorting or even just counting. If no filter hits / the rows are basically unfiltered, every page is loaded once, making the whole query run for about tens of seconds. If filtering drops down the count to 100000 or less it's a question of milliseconds instead.
Further optimization of the query and index structures has diminishing returns at this point, and changing the underlying tables is impractical because of other dependents. Materializing the View or switching anything in the tech stack is also not within scope.
So instead I'd like to change the query coming into the database. Ordering is not really required for the application, so the big question is about the count necessary for pagination.
What would be acceptable is to have a limit for the count - for a few results the exact count may be useful, but if the total count ends up higher than 10000, just saying 'more than 10000 results' in the UI is perfectly fine.
PostgreSQL would allow something like
SELECT COUNT(*) FROM my_view WHERE name LIKE '%abc%'; -- incredibly slow because we need to load gigabytes of data
SELECT COUNT(*) FROM ( SELECT * FROM my_view WHERE name LIKE '%abc%' LIMIT 10000) x; -- incredibly fast because after loading a few pages into memory we reach the limit
Implementing this into custom Spring Data repository methods has turned out to be a harder problem than expected. If the query were static I could send out a native query and be done with it, but Specifications are important.
A few leads in my current research:
introducing a Hibernate StatementInspector may help us inject custom SQL
overwriting/extending the PostgreSQL dialect may do the same in a more specific manner
using query.grouping() and query.having() we may be able to build a different JPQL TypedQuery with the same intention inside the Repository implementation: returning the count from something but with an upper limit. How this query would look like is sadly beyond me.
a not-solution would be to e.g. use a projection to only return column id and just request the first 10000 elements with a normal query and count the list in Java. While internally this would be fine for the DB, the new bottleneck would be network. It also seems too crude.
way more interesting solution would be to use the EXPLAIN sql feature to get a quick row count estimate, but implementing this with JPA seems even more complicated than just limiting the count to a reasonable size.
Unfortunately this is also where I currently stand.
What are the best ways to gain this functionality of a JPA count upper limit, and are there existing solutions I'm overlooking?
Similar questions:
JPA count query with maximum results (unclear if the intention is identical, but it sounds similar)
fetch the count of records from table by setting the Maximum results in JPA

Mybatis cursor query more than 100k records

Using Mybatis, I am querying a huge data from database (about 50k records) but a problem with limited memory and the application restart again. I am currently using List<>, maybe this is the problem.
I am planning Cursor<>, can it solve the problem? If the records grow to above 100k?
Adding a cursor could solve your problem. Another option is batching your data. Is there a field like an id on which you could apply batching?
SELECT TOP(1000) * FROM yourTable WHERE id > {record.id} ORDER BY id
This way in a loop you can retrieve a dataset in the size you want, use it for what you want, save the last record.id and call this query again. This way your application will never run out of memory, even if the number of records in the database increases.

Limiting resource exhaustion on user generated queries on PostgreSQL

I am writing a public-facing query and search API where users can specify multiple conditions. These conditions will be mapped to a complex SQLAlchemy query through various filters and conditions.
What would be the best way to limit the complexity and duration of the query, so that users cannot, by accident or by purpose, cause a denial of service on the system by creating queries that take too long to process? The unsafe conditions would be 1) returning too many items once 2) the resulting query is too complex and slow and will take forever for the PostgreSQL server to evaluate. The database may return a handful or millions of items depending on the filter.
Is implementing pagination with a fixed size LIMIT the only way to go or are there more advanced ways to tackle this?
Can I add a condition on the evaluated SQL expression or SQLAlchemy session to timeout after e.g. 1 second, no matter what the query looks like?
Using PostgreSQL as the database.

Eclipselink batch fetch VS join fetch

When should I use "eclipselink.join-fetch", when should I use "eclipselink.batch" (batch type = IN)?
Is there any limitations for join fetch, such as the number of tables being fetched?
Answer is alway specific to your query, the specific use case, and the database, so there is no hard rule on when to use one over the other, or if to use either at all. You cannot determine what to use unless you are serious about performance and willing to test both under production load conditions - just like any query performance tweaking.
Join-fetch is just what it says, causing all the data to be brought back in the one query. If your goal is to reduce the number of SQL statements, it is perfect. But it comes at costs, as inner/outer joins, cartesian joins etc can increase the amount of data being sent across and the work the database has to do.
Batch fetching is one extra query (1+1), and can be done a number of ways. IN collects all the foreign key values and puts them into one statement (more if you have >1000 on oracle). Join is similar to fetch join, as it uses the criteria from the original query to select over the relationship, but won't return as much data, as it only fetches the required rows. EXISTS is very similar using a subquery to filter.

Fetching millions of records from cassandra using spark in scala performance issue

I have tried single node cluster and 3 node cluster on my local machine to fetch 2.5 million entries from cassandra using spark but in both scenarios it is takes 30 seconds just for SELECT COUNT(*) from table. I need this and similarly other counts for real time analytics.
SparkSession.builder().getOrCreate().sql("SELECT COUNT(*) FROM data").show()
Cassandra isn't designed to iterate over the entire data set in a single expensive query like this. If theres 10 petabytes in data for example this query would require reading 10 petabytes off disk, bring it into memory, stream it to coordinator which will resolve the tombstones/deduplication (you cant just have each replica send a count or you will massively under/over count it) and increment a counter. This is not going to work in a 5 second timeout. You can use aggregation functions over smaller chunks of the data but not in a single query.
If you really want to make this work like this, query the system.size_estimates table of each node, and for each range split according to the size such that you get an approximate max of say 5k per read. Then issue a count(*) for each with a TOKEN restriction for each of the split ranges and combine value of all those queries. This is how spark connector does its full table scans in the SELECT * rrds so you just have to replicate that.
Easiest and probably safer and more accurate (but less efficient) is to use spark to just read the entire data set and then count, not using an aggregation function.
How much does it take to run this query directly without Spark? I think that it is not possible to parallelize COUNT queries so you won't benefit from using Spark for performing such queries.