I've been doing some research on how to set up a new GraphQL API project, but am running into some basic conceptual? problems in trying to find out how to do pagination and nested database queries efficiently.
I'd appreciate any pointers or advice!
Let's say we get a graphql query like so:
articles(limit: 10) {
title
content
comments(limit: 5) {
postedAt
text
}
}
A typical ORM, assuming eager loading of the nested type, could translate this type of query into an sql query like this, and then loop over the results to manually group the comments together and hydrate it all.
select a.title, a.content, c.posted_at, c.text
from articles as a
left join comments as c on c.article_id = a.id
limit ???
But so far, I've only ever seen ORMs like Doctrine (php) and Sequelize (js) fail in doing pagination correctly in these cases. They can't correctly handle page sizes, because there's no way to express the limit in this sql query's setup.
=> Am I correct in seeing this problem? Or am I missing something crucial, are ORMs able to do pagination with eagerly loaded data somehow?
So now I just recently came across the lateral join type in Postgres, which seems to solve this issue, provided we also add some json trickery:
select a.title, a.content, t.data as comments
from articles as a
join lateral (
select json_agg(sub.*) as data
from (
select c.posted_at, c.text
from comments as c
where c.article_id = a.id
limit 5
) sub
) t on true
limit 20;
(I think I've seen this kind of lateral + json trickery stuff in how Hasura and Postgraphile transform to sql, so I don't this it's unwarranted / bad engineering.)
=> Is there any ORM out there (except hasura/postgraphile), possibly Postgres-specific, that use this kind of lateral and json stuff, instead of the typical method described above?
Lastly, my research has taught me that in building a graphql api, you'll typically find yourself data-loading (batching) nested queries, instead of eager-loading them from the "parent" query. So, for example, this would be without data-loading:
class ArticleResolver {
comments(article) {
db.query("select ... from comments where ... = {article.id}");
}
and then this would be with data-loading:
class ArticleResolver {
commentsDataLoader = new DataLoader(articleIds => {
return db.query("select ... from comments where ... in {articleIds}");
});
comments(article) {
return this.commentsDataLoader.load(article.id);
}
But, as soon as you want to start adding parameters like limit: 5 to nested queries, this data-loading query gets as complicated as the original question, so we're back where we were :)
=> Is there a conventional way, of some standard practices, for dealing with this setup? Is there any known way / library so easily write out resolvers like, for example, this:
class ArticleResolver
...
comments(article, limit) {
return db.somehowMagicallyDataloaded("select * from comments ... = {article.id} limit {limit}")
}
Related
I wonder why is Entity framework generating such an inefficient SQL query. In my code I expected the WHERE to act upon the INCLUDE:
db.Employment.Where(x => x.Active).Include(x => x.Employee).Where(x => x.Employee.UserID == UserID)
but I ended up with a double SQL JOIN:
SELECT [x].[ID], [x].[Active], [x].[CurrencyID], [x].[DepartmentID], [x].[EmplEnd], [x].[EmplStart], [x].[EmployeeID], [x].[HolidayGroupID], [x].[HourlyCost], [x].[JobTitle], [x].[ManagerID], [x].[WorkScheduleGroupID], [e].[ID], [e].[Active], [e].[Address], [e].[BirthDate], [e].[CitizenshipID], [e].[City], [e].[CountryID], [e].[Email], [e].[FirstName], [e].[Gender], [e].[LastName], [e].[Note], [e].[Phone], [e].[PostalCode], [e].[TaxNumber], [e].[UserID]
FROM [Employment] AS [x]
INNER JOIN [Employee] AS [x.Employee] ON [x].[EmployeeID] = [x.Employee].[ID]
INNER JOIN [Employee] AS [e] ON [x].[EmployeeID] = [e].[ID]
WHERE ([x].[Active] = 1) AND ([x.Employee].[UserID] = #__UserID_0)
I found out that this query will create better SQL:
db.Employment.Where(x => x.Active).Where(x => x.Employee.UserID == UserID)
SELECT [x].[ID], [x].[Active], [x].[CurrencyID], [x].[DepartmentID], [x].[EmplEnd], [x].[EmplStart], [x].[EmployeeID], [x].[HolidayGroupID], [x].[HourlyCost], [x].[JobTitle], [x].[ManagerID], [x].[WorkScheduleGroupID]
FROM [Employment] AS [x]
INNER JOIN [Employee] AS [x.Employee] ON [x].[EmployeeID] = [x.Employee].[ID]
WHERE ([x].[Active] = 1) AND ([x.Employee].[UserID] = #__UserID_0)
However, the problem here that referenced entities are not retrieved from the DB.
Why don't two codes produce same SQLs?
The SQL is different because the statments are different.
Entity Framework does produce inefficient TSQL, it always has. By abstracting the subtleties that are necessary for SQL with good performance and replacing them with "belt and braces" nearly always work alternatives you sacrafice performance for utility.
If you need good performance, write the SQL yourself. Dapper works well for me. You can't realistically expect a "one size fits all" solution to come up with the best code for your specific situation. You can do this across the board or just where you need to.
Unless you have high volume or specific performance requirements get on with it and use whatever you find easiest. If you need to tune your queries to your database you are going to have learn the details of your database engine and implement the queries yourself. If you are expecting the next iteration of Entity Framework to be the magic bullet that allows you fast, efficient SQL data access with minimal knowledge, good luck.
P.S.
Off-topic but, NoSQL probably isn't the answer either, is just a different class of database.
I spent quite some time to code multiple SQL queries that were formerly used to fetch the data for various R scripts. This is how it worked
sqlContent = readSQLFile("file1.sql")
sqlContent = setSQLVariables(sqlContent, variables)
results = executeSQL(sqlContent)
The clue is, that for some queries a result from a prior query is required - why creating VIEWs in the database itself does not solve this problem. With Spark 2.0 I already figured out a way to do just that through
// create a dataframe using a jdbc connection to the database
val tableDf = spark.read.jdbc(...)
var tempTableName = "TEMP_TABLE" + java.util.UUID.randomUUID.toString.replace("-", "").toUpperCase
var sqlQuery = Source.fromURL(getClass.getResource("/sql/" + sqlFileName)).mkString
sqlQuery = setSQLVariables(sqlQuery, sqlVariables)
sqlQuery = sqlQuery.replace("OLD_TABLE_NAME",tempTableName)
tableDf.createOrReplaceTempView(tempTableName)
var data = spark.sql(sqlQuery)
But this is in my humble opinion very fiddly. Also, more complex queries, e.g. queries that incooporate subquery factoring currently don't work. Is there a more robust way like re-implementing the SQL code into Spark.SQL code using filter($""), .select($""), etc.
The overall goal is to get multiple org.apache.spark.sql.DataFrames, each representing the results of one former SQL query (which always a few JOINs, WITHs, etc.). So n queries leading to n DataFrames.
Is there a better option than the provided two?
Setup: Hadoop v.2.7.3, Spark 2.0.0, Intelli J IDEA 2016.2, Scala 2.11.8, Testcluster on Win7 Workstation
It's not especially clear what your requirement is, but I think you're saying you have queries something like:
SELECT * FROM people LEFT OUTER JOIN places ON ...
SELECT * FROM (SELECT * FROM people LEFT OUTER JOIN places ON ...) WHERE age>20
and you would want to declare and execute this efficiently as
SELECT * FROM people LEFT OUTER JOIN places ON ...
SELECT * FROM <cachedresult> WHERE age>20
To achieve that I would enhance the input file so each sql statement has an associated table name into which the result will be stored.
e.g.
PEOPLEPLACES\tSELECT * FROM people LEFT OUTER JOIN places ON ...
ADULTS=SELECT * FROM PEOPLEPLACES WHERE age>18
Then execute in a loop like
parseSqlFile().foreach({case (name, query) => {
val data: DataFrame = execute(query)
data.createOrReplaceTempView(name)
}
Make sure you declare the queries in order so all required tables have been created. Other do a little more parsing and sort by dependencies.
In an RDMS I'd call these tables Materialised Views. i.e. a transform on other data, like a view, but with the result cached for later reuse.
I have a custom query along these lines. I get the list of orderIds from outside. I have the entire order object list with me, so I can change the query in any way, if needed.
#Query("SELECT p FROM Person p INNER JOIN p.orders o WHERE o.orderId in :orderIds)")
public List<Person> findByOrderIds(#Param("orderIds") List<String> orderIds);
This query works fine, but sometimes it may have anywhere between 50-1000 entries in the orderIds list sent from outside function. So it becomes very slow, taking as much as 5-6 seconds which is not fast enough. My question is, is there a better, faster way to do this? When I googled, and on this site, I see we can use ANY, EXISTS: Postgresql: alternative to WHERE IN respective WHERE NOT IN or create a temporary table: https://dba.stackexchange.com/questions/12607/ways-to-speed-up-in-queries-under-postgresql or join this to VALUES clause: Alternative when IN clause is inputed A LOT of values (postgreSQL). All these answers are tailored towards direct SQL calls, nothing based on JPA. ANY keyword is not supported by spring-data. Not sure about creating temporary tables in custom queries. I think I can do it with native queries, but have not tried it. I am using spring-data + OpenJPA + PostgresSQL.
Can you please suggest a solution or give pointers? I apologize if I missed anything.
thanks,
Alice
You can use WHERE EXISTS instead of IN Clause in a native SQL Query as well as in HQL in JPA which results in a lot of performance benefits. Please see sample below
Sample JPA Query:
SELECT emp FROM Employee emp JOIN emp.projects p where NOT EXISTS (SELECT project from Project project where p = project AND project.status <> 'Active')
I'm using QueryDSL with JPA.
I want to query some properties of an entity, it's like this:
QPost post = QPost.post;
JPAQuery q = new JPAQuery(em);
List<Object[]> rows = q.from(post).where(...).list(post.id, post.name);
It works fine.
If i want to query a relation property, e.g. comments of a post:
List<Set<Comment>> rows = q.from(post).where(...).list(post.comments);
It's also fine.
But when I want to query relation and simple properties together, e.g.
List<Object[]> rows = q.from(post).where(...).list(post.id, post.name, post.comments);
Then something went wrong, generiting a bad SQL syntax.
Then I realized that it's not possible to query them together in one SQL statement.
Is it possible that QueryDSL would somehow deal with relations and generate additional queries (just like what hibernate does with lazy relations), and load the results in?
Or should I just query twice, and then merge both result lists?
P.S. what i actually want is each post with its comments' ids. So a function to concat each post's comment ids is better, is this kind of expressin possible?
q.list(post.id, post.name, post.comments.all().id.join())
and generate a subquery sql like (select group_concat(c.id) from comments as c inner join post where c.id = post.id)
Querydsl JPA is restricted to the expressivity of JPQL, so what you are asking for is not possible with Querydsl JPA. You can though try to express it with Querydsl SQL. It should be possible. Also as you don't project entities, but literals and collections it might work just fine.
Alternatively you can load the Posts with only the Comment ids loaded and then project the id, name and comment ids to something else. This should work when accessors are annotated.
The simplest thing would be to query for Posts and use fetchJoin for comments, but I'm assuming that's too slow for you use case.
I think you ought to simply project required properties of posts and comments and group the results by hand (if required). E.g.
QPost post=...;
QComment comment=..;
List<Tuple> rows = q.from(post)
// Or leftJoin if you want also posts without comments
.innerJoin(comment).on(comment.postId.eq(post.id))
.orderBy(post.id) // Could be used to optimize grouping
.list(new QTuple(post.id, post.name, comment.id));
Map<Long, PostWithComments> results=...;
for (Tuple row : rows) {
PostWithComments res = results.get(row.get(post.id));
if (res == null) {
res = new PostWithComments(row.get(post.id), row.get(post.name));
results.put(res.getPostId(), res);
}
res.addCommentId(row.get(comment.id));
}
NOTE: You cannot use limit nor offset with this kind of queries.
As an alternative, it might be possible to tune your mappings so that 1) Comments are always lazy proxies so that (with property access) Comment.getId() is possible without initializing the actual object and 2) using batch fetch* on Post.comments to optimize collection fetching. This way you could just query for Posts and then access id's of their comments with little performance hit. In most cases you shouldn't even need those lazy proxies unless your Comment is very fat. That kind of code would certainly look nicer without low level row handling and you could also use limit and offset in your queries. Just keep an eye on your query log to make sure everything works as intended.
*) Batch fetching isn't directly supported by JPA, but Hibernate supports it through mapping and Eclipselink through query hints.
Maybe some day Querydsl will support this kind of results grouping post processing out-of-box...
I'm using Zend Framework for my website and I'd like to retrieve some data from my PostgreSQL database.
I have a request like :
SELECT DISTINCT ON(e.id) e.*, f.*, g.* FROM e, f, g
WHERE e.id = f.id_e AND f.id = g.id_f
This request works well but I don't know how to convert the DISTINCT ON(e.id) with Zend.
It seems that I can get DISTINCT rows but no distinct columns.
$select->distinct()->from("e")->join("f", "e.id = f.id_e")
->join("g", "f.id = g.id_f");
Any idea on how to make a select with distinct column ?
Thanks for help
You probably can't do this with Zend Framework since distinct on is not part of the SQL standard (end of page in Postgres documentation). Although Postgres supports it, I would assume its not part of Zend Framework because you could in theory configure another database connection which does not offer support.
If you know in advance that you're developing for a specific database (Postgres in this case), you could use manually written statements instead. You'll gain more flexibility within the queries and better performance at the cost of no longer being able to switch databases.
You would then instantiate a Zend_Db_Apdapter for Postgres. There a various methods available to get results for SQL queries which are described in the frameworks documentation starting at section Reading Query Results. If you choose to go this route I'd recommend to create an own subclass of the Zend_Db_Adapter_Pgsql class. This is to be able to convert data types and throw exceptions in case of errors instead of returning ambiguous null values and hiding error causes.