Flask-sqlalchemy efficiently get count of query ordered by relationship - postgresql

I have 2 models: Post and PostLike. I want to return a query of Posts filtered by their PostLike created_at field, as well as return the total count of posts in that query efficiently (without using .count()).
Simplified models:
class Post(db.Model):
...
post_likes = db.relationship('PostLike', back_populates="post")
class PostLike(db.Model):
...
created_at = db.column(db.Float)
post_id = db.Column(UUID(as_uuid=True), db.ForeignKey("post.id"), index=True)
post = db.relationship("Post", back_populates="post_likes", foreign_keys=[post_id])
Here are the queries I'm trying to run:
# Get all posts
posts = Post.query.join(PostLike, Post.post_likes).order_by(PostLike.created_at.desc()).all()
# Get total # of posts
posts = Post.query.join(PostLike, Post.post_likes).order_by(PostLike.created_at.desc()).count()
There are 3 problems with those queries.
I'm not sure those queries are the best for my use case. Are they?
The query returns the wrong number as count. The count query returns a number higher than the results of the .all() query. Why?
This is not performant as it is calling directly .count(). How do I implement an efficient query to also retrieve the count? Something like .statement.with_only_columns([func.count()])?
I'm using Postgres, and I'm expecting up to millions of rows to count. How do I achieve this efficiently?

re: efficiency of .count()
The comments in other answers (linked in a comment to this question) appear to be outdated for current versions of MySQL and PostgreSQL. Testing against a remote table containing a million rows showed that whether I used
n = session.query(Thing).count()
which renders a SELECT COUNT(*) FROM (SELECT … FROM table_name), or I use
n = session.query(sa.func.count(sa.text("*"))).select_from(Thing).scalar()
which renders SELECT COUNT(*) FROM table_name, the row count was returned in the same amount of time (0.2 seconds in my case). This was tested using SQLAlchemy 1.4.0b2 against both MySQL version 8.0.21 and PostgreSQL version 12.3.

Related

Couchbase N1QL Query getting distinct on the basis of particular fields

I have a document structure which looks something like this:
{
...
"groupedFieldKey": "groupedFieldVal",
"otherFieldKey": "otherFieldVal",
"filterFieldKey": "filterFieldVal"
...
}
I am trying to fetch all documents which are unique with respect to groupedFieldKey. I also want to fetch otherField from ANY of these documents. This otherFieldKey has minor changes from one document to another, but I am comfortable with getting ANY of these values.
SELECT DISTINCT groupedFieldKey, otherField
FROM bucket
WHERE filterFieldKey = "filterFieldVal";
This query fetches all the documents because of the minor variations.
SELECT groupedFieldKey, maxOtherFieldKey
FROM bucket
WHERE filterFieldKey = "filterFieldVal"
GROUP BY groupFieldKey
LETTING maxOtherFieldKey= MAX(otherFieldKey);
This query works as expected, but is taking a long time due to the GROUP BY step. As this query is used to show products in UI, this is not a desired behaviour. I have tried applying indexes, but it has not given fast results.
Actual details of the records:
Number of records = 100,000
Size per record = Approx 10 KB
Time taken to load the first 10 records: 3s
Is there a better way to do this? A way of getting DISTINCT only on particular fields will be good.
EDIT 1:
You can follow this discussion thread in Couchbase forum: https://forums.couchbase.com/t/getting-distinct-on-the-basis-of-a-field-with-other-fields/26458
GROUP must materialize all the documents. You can try covering index
CREATE INDEX ix1 ON bucket(filterFieldKey, groupFieldKey, otherFieldKey);

Query performance issues when using MATCH and SELECT together OrientDB

I'm facing a problem with a query in OrientDB.
SELECT FROM (
MATCH
{class: article, as: article}.in('authorOf'){as: author}
RETURN article, author
) ORDER BY createdAt desc SKIP 0 LIMIT 50
As you see I want to fetch the last 50 most recent articles with their corresponding author. The problem I'm facing is that the subquery first iterates over all my articles then passes it down to the parent and then it gets filtered. This is obviously not very effective, because all the articles are loaded into memory when I only need 50 of them.
Does anyone know a better approach without having to use multiple queries.
you could try with
select #rid as article,in('authorOf')[0] as author from article order by createdAt desc SKIP 0 LIMIT 50
With this one I'm getting a sligthly better performances, but nothing extreme.
EDIT following Luigi's comment
Create an index on createdAt property:
CREATE INDEX article.createdAt ON article (createdAt) NOTUNIQUE
PS
I'm not sure that the order by in your query is working well

Query one document per association from MongoDB

I'm investigating how MongoDB would work for us. One of the most used queries is used to get latest (or from a given time) measurements for each station. There is thousands of stations and each station has tens of thousands of measurements.
So we plan to have one collection for stations and another for measurements.
In SQL we would do the query with
SELECT * FROM measurements
INNER JOIN (
SELECT max(meas_time) station_id
FROM measurements
WHERE meas_time <= 'time_to_query'
GROUP BY station_id
) t2 ON t2.station_id = measurements.station_id
AND t2.meas_time = measurements.meas_time
This returns one measurement for each station, and the measurement is the newest one before the 'time_to_query'.
What query should be used in MongoDB to produce the same result? We are really using Rails and MongoId, but it should not matter.
update:
This question is not about how to perform a JOIN in MongoDB. The fact that in SQL getting the right data out of the table requires a join doesn't necessary mean that in MongoDB we would also need a join. There is only one table used in the query.
We came up with this query
db.measurements.aggregate([{$group:{ _id:{'station_id':"$station_id"}, time:{$max:'$meas_time'}}}]);
with indexes
db.measurements.createIndex({ station_id: 1, meas_time: -1 });
Even though it seems to give the right data it is really slow. Takes roughly a minute to get a bit over 3000 documents from a collection of 65 million.
Just found that MongoDB is not using the index in this query even though we are using the 3.2 version.
I guess worst case solution would be something like this (out of my head):
meassures = []
StationId.all.each do |station|
meassurement = Meassurment.where(station_id: station.id, meas_time <= 'time_to_query').order_by(meas_time: -1).limit(1)
meassures << [station.name, meassurement.measure, ....]
end
It depends on how much time query can take. Data should anyway be indexed by station_id and meas_time.
How much time does the SQL query take?

Are Postgres WHERE clauses run sequentially?

I'm looking at using Postgres as a database to let our clients segment their customers.
The idea is that they can select a bunch of conditions in our front-end admin, and these conditions will get mapped to a SQL query. Right now, I'm thinking the best structure could be something like this:
SELECT DISTINCT id FROM users
WHERE id IN (
-- condition 1
)
AND id IN (
-- condition 2
)
AND id IN (
-- etc
)
Efficiency and query speed is super important to us, and I'm wondering if this is the best way of structuring things. When going through each of the WHERE clauses, will Postgres pass the id values from one to the next?
The ideal scenario would be, for a group of 1m users:
Query 1 filters down to 100k
Query 2 filters down from 100k to 10k
Query 3 filters down to 10k to 5k
As opposed to:
Query 1 filters from 1m to 100k
Query 2 filters down from 1m to 50k
Query 3 filters down from 1m to 80k
The intersection of all queries are mashed together, to 5k
Maybe I'm misunderstanding something here, I'd love to get your thoughts!
Thanks!
Postgres uses a query planner to figure out how to most efficiently apply your query. It may reorder things or change how certain query operations (such as joins) are implemented, based on statistical information periodically collected in the background.
To determine how the query planner will structure a given query, stick EXPLAIN in front of it:
EXPLAIN SELECT DISTINCT id FROM users ...;
This will output the query plan for that query. Note that an empty table may get a totally different query plan from a table with (say) 10,000 rows, so be sure to test on real(istic) data.
Database engines are much more sophisticated than that.
The specific order of the conditions should not matter. They will take your query as a whole and try to figure out the best way to get the data according to all the conditions you specified, the indexes that each table has, the amount of records each condition will filter out, etc.
If you want to get an idea of how your query will actually be solved you can ask the engine to "explain" it for you: http://www.postgresql.org/docs/current/static/sql-explain.html
However, please note that there is a lot of technical background on how DB engines actually work in order to understand what that explanation means.

sphinxQl plain index and RT index?

how to get records from more than one index using sphinxQL .
Here i have faced one problem
all the records i am going to retain in plain index except today records. Today records going to maintain in RT index.
while fetching records from that index , we need to get records from recently changed index.
using SphinxAPI it has return records from recently changed index [Rt index]. How to proceed the same way in SphinxQL.
SELECT * FROM index1, index2, index3 WHERE ...
SphinxQL is not like mysql, where a comma means a join, in sphinx its closer to a union
I think that the best way to achieve this will be to create distributed index, which consist of the indexes which you want to use. For example:
index tehindex
{
type = distributed
local = disk_based_index_name_here
local = rt_index_name_here
}
and then query sphinx with SphinxQL like this:
select * from tehindex where match('test');