Query performance issues when using MATCH and SELECT together OrientDB - orientdb

I'm facing a problem with a query in OrientDB.
SELECT FROM (
MATCH
{class: article, as: article}.in('authorOf'){as: author}
RETURN article, author
) ORDER BY createdAt desc SKIP 0 LIMIT 50
As you see I want to fetch the last 50 most recent articles with their corresponding author. The problem I'm facing is that the subquery first iterates over all my articles then passes it down to the parent and then it gets filtered. This is obviously not very effective, because all the articles are loaded into memory when I only need 50 of them.
Does anyone know a better approach without having to use multiple queries.

you could try with
select #rid as article,in('authorOf')[0] as author from article order by createdAt desc SKIP 0 LIMIT 50
With this one I'm getting a sligthly better performances, but nothing extreme.
EDIT following Luigi's comment
Create an index on createdAt property:
CREATE INDEX article.createdAt ON article (createdAt) NOTUNIQUE
PS
I'm not sure that the order by in your query is working well

Related

Flask-sqlalchemy efficiently get count of query ordered by relationship

I have 2 models: Post and PostLike. I want to return a query of Posts filtered by their PostLike created_at field, as well as return the total count of posts in that query efficiently (without using .count()).
Simplified models:
class Post(db.Model):
...
post_likes = db.relationship('PostLike', back_populates="post")
class PostLike(db.Model):
...
created_at = db.column(db.Float)
post_id = db.Column(UUID(as_uuid=True), db.ForeignKey("post.id"), index=True)
post = db.relationship("Post", back_populates="post_likes", foreign_keys=[post_id])
Here are the queries I'm trying to run:
# Get all posts
posts = Post.query.join(PostLike, Post.post_likes).order_by(PostLike.created_at.desc()).all()
# Get total # of posts
posts = Post.query.join(PostLike, Post.post_likes).order_by(PostLike.created_at.desc()).count()
There are 3 problems with those queries.
I'm not sure those queries are the best for my use case. Are they?
The query returns the wrong number as count. The count query returns a number higher than the results of the .all() query. Why?
This is not performant as it is calling directly .count(). How do I implement an efficient query to also retrieve the count? Something like .statement.with_only_columns([func.count()])?
I'm using Postgres, and I'm expecting up to millions of rows to count. How do I achieve this efficiently?
re: efficiency of .count()
The comments in other answers (linked in a comment to this question) appear to be outdated for current versions of MySQL and PostgreSQL. Testing against a remote table containing a million rows showed that whether I used
n = session.query(Thing).count()
which renders a SELECT COUNT(*) FROM (SELECT … FROM table_name), or I use
n = session.query(sa.func.count(sa.text("*"))).select_from(Thing).scalar()
which renders SELECT COUNT(*) FROM table_name, the row count was returned in the same amount of time (0.2 seconds in my case). This was tested using SQLAlchemy 1.4.0b2 against both MySQL version 8.0.21 and PostgreSQL version 12.3.

Couchbase N1QL Query getting distinct on the basis of particular fields

I have a document structure which looks something like this:
{
...
"groupedFieldKey": "groupedFieldVal",
"otherFieldKey": "otherFieldVal",
"filterFieldKey": "filterFieldVal"
...
}
I am trying to fetch all documents which are unique with respect to groupedFieldKey. I also want to fetch otherField from ANY of these documents. This otherFieldKey has minor changes from one document to another, but I am comfortable with getting ANY of these values.
SELECT DISTINCT groupedFieldKey, otherField
FROM bucket
WHERE filterFieldKey = "filterFieldVal";
This query fetches all the documents because of the minor variations.
SELECT groupedFieldKey, maxOtherFieldKey
FROM bucket
WHERE filterFieldKey = "filterFieldVal"
GROUP BY groupFieldKey
LETTING maxOtherFieldKey= MAX(otherFieldKey);
This query works as expected, but is taking a long time due to the GROUP BY step. As this query is used to show products in UI, this is not a desired behaviour. I have tried applying indexes, but it has not given fast results.
Actual details of the records:
Number of records = 100,000
Size per record = Approx 10 KB
Time taken to load the first 10 records: 3s
Is there a better way to do this? A way of getting DISTINCT only on particular fields will be good.
EDIT 1:
You can follow this discussion thread in Couchbase forum: https://forums.couchbase.com/t/getting-distinct-on-the-basis-of-a-field-with-other-fields/26458
GROUP must materialize all the documents. You can try covering index
CREATE INDEX ix1 ON bucket(filterFieldKey, groupFieldKey, otherFieldKey);

Why MongoDB find has same performance as count

I am running tests against my MongoDB and for some reason find has the same performance as count.
Stats:
orders collection size: ~20M,
orders with product_id 6: ~5K
product_id is indexed for improved performance.
Query: db.orders.find({product_id: 6}) vs db.orders.find({product_id: 6}).count()
result the orders for the product vs 5K after 0.08ms
Why count isn't dramatically faster? it can find the first and last elements position with the product_id index
As Mongo documentation for count states, calling count is same as calling find, but instead of returning the docs, it just counts them. In order to perform this count, it iterates over the cursor. It can't just read the index and determine the number of documents based on first and last value of some ID, especially since you can have index on some other field that's not ID (and Mongo IDs are not auto-incrementing). So basically find and count is the same operation, but instead of getting the documents, it just goes over them and sums their number and return it to you.
Also, if you want a faster result, you could use estimatedDocumentsCount (docs) which would go straight to collection's metadata. This results in loss of the ability to ask "What number of documents can I expect if I trigger this query?". If you need to find a count of docs for a query in a faster way, then you could use countDocuments (docs) which is a wrapper around an aggregate query. From my knowledge of Mongo, the provided query looks like a fastest way to count query results without calling count. I guess that this should be preferred way regarding performances for counting the docs from now on (since it's introduced in version 4.0.3).

Are Postgres WHERE clauses run sequentially?

I'm looking at using Postgres as a database to let our clients segment their customers.
The idea is that they can select a bunch of conditions in our front-end admin, and these conditions will get mapped to a SQL query. Right now, I'm thinking the best structure could be something like this:
SELECT DISTINCT id FROM users
WHERE id IN (
-- condition 1
)
AND id IN (
-- condition 2
)
AND id IN (
-- etc
)
Efficiency and query speed is super important to us, and I'm wondering if this is the best way of structuring things. When going through each of the WHERE clauses, will Postgres pass the id values from one to the next?
The ideal scenario would be, for a group of 1m users:
Query 1 filters down to 100k
Query 2 filters down from 100k to 10k
Query 3 filters down to 10k to 5k
As opposed to:
Query 1 filters from 1m to 100k
Query 2 filters down from 1m to 50k
Query 3 filters down from 1m to 80k
The intersection of all queries are mashed together, to 5k
Maybe I'm misunderstanding something here, I'd love to get your thoughts!
Thanks!
Postgres uses a query planner to figure out how to most efficiently apply your query. It may reorder things or change how certain query operations (such as joins) are implemented, based on statistical information periodically collected in the background.
To determine how the query planner will structure a given query, stick EXPLAIN in front of it:
EXPLAIN SELECT DISTINCT id FROM users ...;
This will output the query plan for that query. Note that an empty table may get a totally different query plan from a table with (say) 10,000 rows, so be sure to test on real(istic) data.
Database engines are much more sophisticated than that.
The specific order of the conditions should not matter. They will take your query as a whole and try to figure out the best way to get the data according to all the conditions you specified, the indexes that each table has, the amount of records each condition will filter out, etc.
If you want to get an idea of how your query will actually be solved you can ask the engine to "explain" it for you: http://www.postgresql.org/docs/current/static/sql-explain.html
However, please note that there is a lot of technical background on how DB engines actually work in order to understand what that explanation means.

mongoDB group by/distinct query

model checkin:
checkin
_id
interest_id
author_id
I've got a collection of checkins (resolved by simple "find" query)
I'd like to count the number of checkins for each interest.
What makes the task a bit more difficult - we should count two checkins from one person and one interest as one checkin.
AFAIK, group operations in mongo are performed by map/reduce query. Should we use it here? The only idea I've got with such an approach is to aggregate the array of users for each interest and then return this array's length.
EDIT I ended up with not using map/reduce at all, allthough Emily's answer worked fine & quick.
I have to select only checkins from last 60 minutes, and there shouldn't be too many results. So I just get all of them to Ruby driver, and do all the calculation's on ruby side. It's a bit slower, but much more scalable and easy-to-understand.
best,
Roman
Map reduce would probably be the way to go for this and you could get the desired results with two map reduces.
In the first, you could remove duplicate author_id and interest_id pairs.
key would be author_id and interest_id
values would be checkin_id
The second map reduce would just be a count of the number of checkins by a given author_id.
key would be author_id
value would be checkin_id count