Find last n entries ordered by association - postgresql

I have a table Stories and a table Post. Each Story contains multiple Posts (Story.hasMany(models.Post);, Post.belongsTo(models.Story);)
What I'm trying to achieve is to list the first 10 Stories ordered by Posts.createdAt. So it might be possible that the first entry is the oldest Story but with a very new Post.
What I'm trying right now is the following:
var options = {
limit: 10,
offset: 0,
include: [{
model: models.sequelize.model('Post'),
attributes: ['id', 'createdAt'],
required: true
}],
order: [
[models.sequelize.model('Post'), 'createdAt', 'DESC'],
['createdAt', 'DESC']
],
attributes: ['id', 'title', 'createdAt']
};
Story.findAll(options)...
Which gives me this SQL query:
SELECT "Story".*, "Posts"."id" AS "Posts.id", "Posts"."createdAt" AS "Posts.createdAt"
FROM (SELECT "Story"."id", "Story"."title", "Story"."createdAt"
FROM "Stories" AS "Story"
WHERE ( SELECT "StoryId"
FROM "Posts" AS "Post"
WHERE ("Post"."StoryId" = "Story"."id") LIMIT 1 ) IS NOT NULL
ORDER BY "Story"."createdAt" DESC LIMIT 10) AS "Story"
INNER JOIN "Posts" AS "Posts" ON "Story"."id" = "Posts"."StoryId"
ORDER BY "Posts"."createdAt" DESC, "Story"."createdAt" DESC;
The problem here is that if the 11th entry has a very new post it is not displayed in the top 10 list.
How can I get a limited list of stories ordered by Posts.createdAt?

Why are you using two nested subselects? That's ineffective, might produce expensive nested loops, and still does not return what you are looking for.
As a start, you can cross-join Stories and Posts, order by the creation timestamp from Posts. But this still might scan the entire table.
Have a look at this presentation:
http://www.slideshare.net/MarkusWinand/p2d2-pagination-done-the-postgresql-way
But I have no idea, how you can bring that into your model :-(

Related

SQL query to filter where all array items in JSONB array meet condition

I made a similar post before, but deleted it as it had contextual errors.
One of the tables in my database includes a JSONB column which includes an array of JSON objects. It's not dissimilar to this example of a session table which I've mocked up below.
id
user_id
snapshot
inserted_at
1
37
{cart: [{product_id: 1, price_in_cents: 3000, name: "product A"}, {product_id: 2, price_in_cents: 2500, name: "product B"}]}
2022-01-01 20:00:00.000000
2
24
{cart: [{product_id: 1, price_in_cents: 3000, name: "product A"}, {product_id: 3, price_in_cents: 5500, name: "product C"}]}
2022-01-02 20:00:00.000000
3
88
{cart: [{product_id: 4, price_in_cents: 1500, name: "product D"}, {product_id: 2, price_in_cents: 2500, name: "product B"}]}
2022-01-03 20:00:00.000000
The query I've worked with to retrieve records from this table is as follows.
SELECT sessions.*
FROM sessions
INNER JOIN LATERAL (
SELECT *
FROM jsonb_to_recordset(sessions.snapshot->'cart')
AS product(
"product_id" integer,
"name" varchar,
"price_in_cents" integer
)
) AS cart ON true;
I've been trying to update the query above to retrieve only the records in the sessions table for which ALL of the products in the cart have a price_in_cents value of greater than 2000.
To this point, I've not had any success on forming this query but I'd be grateful if anyone here can point me in the right direction.
You can use a JSON path expression:
select *
from sessions
...
where not sessions.snapshot ## '$.cart[*].price_in_cents <= 2000'
There is no JSON path expression that would check that all array elements are greater 2000. So this returns those rows where no element is smaller than 2000 - because that can be expressed with a JSON path expression.
Here is one possible solution based on the idea of your original query.
Each element of the cart JSON array object is joined to its sessions parent row. You 're left adding the WHERE clause conditions now that the wanted JSON array elements are exposed.
SELECT *
FROM (
SELECT
sess.id,
sess.user_id,
sess.inserted_at,
cart_items.cart_name,
cart_items.cart_product_id,
cart_items.cart_price_in_cents
FROM sessions sess,
LATERAL (SELECT (snapshot -> 'cart') snapshot_cart FROM sessions WHERE id = sess.id) snap_arr,
LATERAL (SELECT
(value::jsonb ->> 'name')::text cart_name,
(value::jsonb -> 'product_id')::int cart_product_id,
(value::jsonb -> 'price_in_cents')::int cart_price_in_cents
FROM JSONB_ARRAY_ELEMENTS(snap_arr.snapshot_cart)) cart_items
) session_snapshot_cart_product;
Explanation :
From the sessions table, the cart array is exctracted and joined per sessions row
The necessary items of the cart JSON array is then unnested by the second join using the JSONB_ARRAY_ELEMENTS(jsonb) function
The following worked well for me and allowed me the flexibility to use different comparison operators other than just ones such as == or <=.
In one of the scenarios I needed to construct, I also needed to have my WHERE in the subquery also compare against an array of values using the IN comparison operator, which was not viable using some of the other solutions that were looked at.
Leaving this here in case others run into the same issue as I did, or if others find better solutions or want to propose suggestions to build upon this one.
SELECT *
FROM sessions
WHERE NOT EXISTS (
SELECT sessions.*
FROM sessions
INNER JOIN LATERAL (
SELECT *
FROM jsonb_to_recordset(sessions.snapshot->'cart')
AS product(
"product_id" integer,
"name" varchar,
"price_in_cents" integer
)
) AS cart ON true
WHERE name ILIKE "Product%";
)

Very slow PSQL query with several JOINs

I've been having problems with super slow query in PostgreSQL.
DB ER diagram part focused in this problem:
Table culture has 6 records, table microclimate_value has roughly 190k records, table location has 3 records and table crop_yield has roughly 40k records.
Query:
SELECT max(cy.value) AS yield, EXTRACT(YEAR FROM cy.date) AS year
FROM microclimate_value AS mv
JOIN culture AS c ON mv.id_culture = c.id
JOIN location AS l ON mv.id_location = l.id
JOIN crop_yield AS cy ON l.id = cy.id_location
WHERE c.id = :cultureId AND l.id = :locationId
GROUP BY year
ORDER BY year
This query should result with max value from (crop_yield table) for every year for given :cultureId (primary key from culture table) and :locationId (primary key from location table). It would look something like this (yield == value column from crop_yield table):
[
{
"year": 2014,
"yield": 0.0
},
{
"year": 2015,
"yield": 1972.6590590838807
},
{
"year": 2016,
"yield": 3254.6370785040726
},
{
"year": 2017,
"yield": 2335.5804000689095
},
{
"year": 2018,
"yield": 3345.2244602819046
},
{
"year": 2019,
"yield": 3004.7096788680583
},
{
"year": 2020,
"yield": 2920.8721807693764
},
{
"year": 2021,
"yield": 0.0
}
]
Enhancement attempt:
Initially, this query took around 10 minutes, so there is some big problem with optimization or with the query itself. The first thing I did was indexing foreign keys in microclimate_value and crop_yield table, which resulted in far better performance, but the query still takes 2-3 minutes to execute.
Does anyone have any tip on how to improve this? I am open for any tips, including changing the whole schema if needed, considering the fact I'm still learning SQL.
Thanks in advance!
Edit:
Adding EXPLAIN PSQL
Adding second EXPLAIN ANALYZE PSQL after adding indexes:
Make some combinations of columns in a single index. I would start with this, to get rid of all the filtering after searching for the data:
CREATE INDEX idx_crop_yield_id_location_year_value ON crop_yield(id_location, (EXTRACT ( YEAR FROM DATE )), value);
CREATE INDEX idx_microclimate_value_id_location_id_culture ON microclimate_value(id_location, id_culture);
Maybe a different order in the columns works better, that's something you have to find out.
I would also leave the unused table "culture" out:
SELECT MAX( cy.VALUE ) AS yield,
EXTRACT ( YEAR FROM cy.DATE ) AS YEAR
FROM
microclimate_value AS mv
JOIN LOCATION AS l ON mv.id_location = l.ID
JOIN crop_yield AS cy ON l.ID = cy.id_location
WHERE
mv.id_culture = : cultureId
AND l.ID = : locationId
GROUP BY YEAR
ORDER BY YEAR;
And after every change in the query or the indexes, run EXPLAIN(ANALYZE, VERBOSE, BUFFERS) again.
Based on your explain analyze there are 10,970 rows of microclimate_value for location=2 and id_culture=1. Also there are 12,316 rows for location=2 in crop_yield.
As there is no other condition for join of those 2 tables, the database has to create in memory a table with 10,970*12,316=135,106,520 rows and then group its results. It might take some timeā€¦
I think you are missing some condition in your query. Are you sure there should not be the same date on microclimate_value.date and crop_yield.date? Because, IMHO, without it, the query does not make much sense.
If there's no connection with those dates, then the only information that might be useful in microclimate_value is whether matching location_id=? and culture_id=? exists there:
select
max(value) as max_value,
extract(year from date) as year,
from crop_yield
where location_id=?
and exists(
select 1
from microclimate_value
where location_id=? and culture_id=?
)
group by year
You'll either get results, if they match somewhere, or won't get any. The design of this schema seems questionable.

Finding top N entries per group in Arango

I'm trying to efficiently find the top entries by group in Arango (AQL). I have a fairly standard object collection and an edge collection representing Departments and Employees in that department.
Example purpose: Find the top 2 employees in each department by most years of experience.
Sample Data:
"departments" is an object collection. Here are some entries:
_id
name
departments/1
engineering
departments/2
sales
"dept_emp_edges" is an edge collection connecting departments and employee objects by ids.
_id
_from
_to
years_exp
dept_emp_edges/1
departments/1
employees/1
3
dept_emp_edges/2
departments/1
employees/2
4
dept_emp_edges/3
departments/1
employees/3
5
dept_emp_edges/4
departments/2
employees/1
6
I would like to end up with the top 2 employees per department by most years experience:
department
employee
years_exp
departments/1
employee/3
5
departments/1
employee/2
4
departments/2
employee/1
6
Long Working Query
The following query works! But is a bit slow on larger tables and feels inefficient.
FOR dept IN departments
LET top2earners = (
FOR dep_emp_edge IN dept_emp_edges
FILTER dep_emp_edge._from == dept._id
SORT dep_emp_edge.years_exp DESC
LIMIT 2
RETURN {'department': dep_emp_edge._from,
'employee': dep_emp_edge._to,
'years_exp': dep_emp_edge.years_exp}
)
FOR row in top2earners
return {'department': dep_emp_edge._from,
'employee': dep_emp_edge._to,
'years_exp': dep_emp_edge.years_exp}
I don't like this because there is 3 loops in here and feels rather inefficient.
Short Query
However, I tried to write:
FOR dept IN departments
FOR dep_emp_edge IN dept_emp_edges
FILTER dep_emp_edge._from == dept._id
SORT dep_emp_edge.years_exp DESC
LIMIT 2
RETURN {'department': dep_emp_edge._from,
'employee': dep_emp_edge._to,
'years_exp': dep_emp_edge.years_exp}
But this last query only outputs the final department top 2 results. Not all of the top 2 in each department.
My questions are: (1) why doesn't the second shorter query give all results? and (2) I'm quite new to Arango and ArangoQL, what other things can I do to make sure this is efficient?
Your first query is incorrect as written (Query: AQL: collection or view not found: dep_emp_edge (while parsing)) - as I could only guess what you mean, I ignore it for now.
Your smaller query limits the overall results to two - counter intuitively - as you are not grouping by department.
I suggest a slightly different approach: Use the edge collection as central source and group by _from, returning one document per department, containing an array of the two top resulting employees (should they exist), not one document per employee:
FOR edge IN dept_emp_edges
SORT edge.years_exp DESC
COLLECT dep = edge._from INTO deps
LET emps = (
FOR e in deps
LIMIT 2
RETURN ZIP(["employee", "years_exp"], [e.edge._to, e.edge.years_exp])
)
RETURN {"department": dep, employees: emps}
For your example database this returns:
[
{
"department": "departments/1",
"employees": [
{
"employee": "employees/3",
"years_exp": 5
},
{
"employee": "employees/2",
"years_exp": 4
}
]
},
{
"department": "departments/2",
"employees": [
{
"employee": "employees/1",
"years_exp": 6
}
]
}
]
If the query is too slow, an index on the year_exp-field of the dept_emp_edges collection could help (Explain suggests it would).

MongoDB php subquery

i'm new to nosql and currently trying to work with mongodb.
From sql statement:
select id from table1 where id in (select related_id from table2 where column_name='somevalue')
what would be the equivalent mongodb/php syntax of this query ?
I have populated the 2 collections with sample data, trying to figure out with aggregate but no results so far. There are plenty of samples around but couldn't find something of this type of sub-query.
Any help is appreciated.
get the ids from table2
ids = db.table2.find({ "columnname": "somevalue"},{ id: 1, _id: 0 })
query the table1 collection with the ids from previous query
db.table1.find({ "_id": { "$in": ids } },{id:1,_id:0})

Select most reviewed courses starting from courses having at least 2 reviews

I'm using Flask-SQLAlchemy with PostgreSQL. I have the following two models:
class Course(db.Model):
id = db.Column(db.Integer, primary_key = True )
course_name =db.Column(db.String(120))
course_description = db.Column(db.Text)
course_reviews = db.relationship('Review', backref ='course', lazy ='dynamic')
class Review(db.Model):
__table_args__ = ( db.UniqueConstraint('course_id', 'user_id'), { } )
id = db.Column(db.Integer, primary_key = True )
review_date = db.Column(db.DateTime)#default=db.func.now()
review_comment = db.Column(db.Text)
rating = db.Column(db.SmallInteger)
course_id = db.Column(db.Integer, db.ForeignKey('course.id') )
user_id = db.Column(db.Integer, db.ForeignKey('user.id') )
I want to select the courses that are most reviewed starting with at least two reviews. The following SQLAlchemy query worked fine with SQlite:
most_rated_courses = db.session.query(models.Review, func.count(models.Review.course_id)).group_by(models.Review.course_id).\
having(func.count(models.Review.course_id) >1) \ .order_by(func.count(models.Review.course_id).desc()).all()
But when I switched to PostgreSQL in production it gives me the following error:
ProgrammingError: (ProgrammingError) column "review.id" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: SELECT review.id AS review_id, review.review_date AS review_...
^
'SELECT review.id AS review_id, review.review_date AS review_review_date, review.review_comment AS review_review_comment, review.rating AS review_rating, review.course_id AS review_course_id, review.user_id AS review_user_id, count(review.course_id) AS count_1 \nFROM review GROUP BY review.course_id \nHAVING count(review.course_id) > %(count_2)s ORDER BY count(review.course_id) DESC' {'count_2': 1}
I tried to fix the query by adding models.Review in the GROUP BY clause but it did not work:
most_rated_courses = db.session.query(models.Review, func.count(models.Review.course_id)).group_by(models.Review.course_id).\
having(func.count(models.Review.course_id) >1) \.order_by(func.count(models.Review.course_id).desc()).all()
Can anyone please help me with this issue. Thanks a lot
SQLite and MySQL both have the behavior that they allow a query that has aggregates (like count()) without applying GROUP BY to all other columns - which in terms of standard SQL is invalid, because if more than one row is present in that aggregated group, it has to pick the first one it sees for return, which is essentially random.
So your query for Review basically returns to you the first "Review" row for each distinct course id - like for course id 3, if you had seven "Review" rows, it's just choosing an essentially random "Review" row within the group of "course_id=3". I gather the answer you really want, "Course", is available here because you can take that semi-randomly selected Review object and just call ".course" on it, giving you the correct Course, but this is a backwards way to go.
But once you get on a proper database like Postgresql you need to use correct SQL. The data you need from the "review" table is just the course_id and the count, nothing else, so query just for that (first assume we don't actually need to display the counts, that's in a minute):
most_rated_course_ids = session.query(
Review.course_id,
).\
group_by(Review.course_id).\
having(func.count(Review.course_id) > 1).\
order_by(func.count(Review.course_id).desc()).\
all()
but that's not your Course object - you want to take that list of ids and apply it to the course table. We first need to keep our list of course ids as a SQL construct, instead of loading the data - that is, turn it into a derived table by converting the query into a subquery (change the word .all() to .subquery()):
most_rated_course_id_subquery = session.query(
Review.course_id,
).\
group_by(Review.course_id).\
having(func.count(Review.course_id) > 1).\
order_by(func.count(Review.course_id).desc()).\
subquery()
one simple way to link that to Course is to use an IN:
courses = session.query(Course).filter(
Course.id.in_(most_rated_course_id_subquery)).all()
but that's essentially going to throw away the "ORDER BY" you're looking for and also doesn't give us any nice way of actually reporting on those counts along with the course results. We need to have that count along with our Course so that we can report it and also order by it. For this we use a JOIN from the "course" table to our derived table. SQLAlchemy is smart enough to know to join on the "course_id" foreign key if we just call join():
courses = session.query(Course).join(most_rated_course_id_subquery).all()
then to get at the count, we need to add that to the columns returned by our subquery along with a label so we can refer to it:
most_rated_course_id_subquery = session.query(
Review.course_id,
func.count(Review.course_id).label("count")
).\
group_by(Review.course_id).\
having(func.count(Review.course_id) > 1).\
subquery()
courses = session.query(
Course, most_rated_course_id_subquery.c.count
).join(
most_rated_course_id_subquery
).order_by(
most_rated_course_id_subquery.c.count.desc()
).all()
A great article I like to point out to people about GROUP BY and this kind of query is SQL GROUP BY techniques which points out the common need for the "select from A join to (subquery of B with aggregate/GROUP BY)" pattern.