Efficient (graph) aggregations in OrientDB - orientdb

Given a graph with interconnected entities:
What is the most efficient way to aggregate common Vertices based on Edges. For instance - with the given graph - return Musicians with an aggregated band count.
My current approach is aggregating post selection:
select m, count(b) as cnt from (match {class:Musician, as: m}<-currentMember-{as:b} return m, b) group by m order by cnt desc limit 10
But this looks highly inefficient.

Try this:
select name, in('currentMember').size() as band from Musician order by band desc
Hope it helps
Regards

Related

SQLite: Using MAX of GROUP BY averages to compute percentage of all the other averages relative to that MAX

I have four groups of people, and they each have an average for a given metric. The following query would yield four values, one for each group.
SELECT group, AVG(metric) AS 'avg_metric'
FROM table
GROUP BY group
Now one of those averages will be the max. I want to also capture the avg_metric / MAX(avg_metric) in my SELECT statement. Since I can't use MAX and AVG(MAX...) in the same query, I thought something like this would work:
SELECT group, (100 * sub.avg_metric / MAX(sub.avg_metric)) AS 'percentage'
FROM table
JOIN (SELECT AVG(metric) AS 'avg_metric'
FROM table
GROUP BY group
ON table.group = sub.group) AS sub
Unfortunately, I can't seem to get the syntax right. (I'm using SQLite.) Additionally, it would be much nicer to have
GROUP AVG_METRIC PERCENTAGE
group 1 57 45
group 2 ....
In that order, but I don't see how to do that either.

How to get SUM and AVG from a column in PostgreSQL

Maybe I'm overlooking something, but none of the answers I found solve my problem. I'm trying to get the sum and average from a column, but everything I see is getting sum and average from a row.
Here is the query I'm using:
SELECT product_name,unit_cost,units,total,SUM(total),AVG(total)
FROM products
GROUP BY product_name,unit_cost,total
And this is what I get:
It returns the exact same amounts. What I need is to add all values in the unit_cost column and return the SUM and AVG of all its values. What am I missing? What did I not understand? Thank you for taking the time to answer!
AVG and SUM as window functions and no grouping will do the job.
select product_name,unit_cost,units,total,
SUM(total) over all_rows as sum_of_all_rows,
AVG(total) over all_rows as avg_of_all_rows
from products
window all_rows as ();
The groups in your query contain just one row if total is a distinct value. This seems to be the case with your example. You can check this with a count aggregate (value = 1).
Removing total and probably unit_cost) from your select and group by clause should help.

How to get information on agregate object with Postgres

I have a request that use a group by to get the MIN distance between two object.
WITH point_interet AS (
SELECT pai.ogc_fid as p1id, pai2.ogc_fid as p2id, MIN(ST_Distance(pai.geom, pai2.geom)) AS distance
FROM point_activite_interet pai
JOIN point_activite_interet pai2 ON pai.ogc_fid > pai2.ogc_fid
GROUP BY pai.ogc_fid)
SELECT * FROM point_interet
ORDER BY distance DESC;
This doesn't work because Postgres says p2id should be in GROUP BY clause actually it is not true, I would like to know what is the object the closest from pai.ogc_fid.
Do you have any idea how I should do that?
I think you just want to remove the min() aggregate
Alternatively you could add a window specification, such as min(...) OVER (ROWS UNBOUNDED PRECEDING) which would allow you to do this, but I don't think this is what you want because it will give you the minimum distance of any preceding row compbination.....

Greatest n per group with multiple criteria for greatest

I need to select the largest, most recent or currently active term across a number of schools, with the assumption that is possible for a school to have multiple concurrent terms (ie, one term that honors students are registered in, and another for non honors). Also need to take into account the end date, as the honors term may have the same start date but may be year long instead of just a semester, and I want the semester.
Code looks something like this:
SELECT t.school_id, t.term_id, COUNT(s.id) AS size, t.start_date, t.end_date
FROM term t
INNER JOIN students s ON t.term_id = s.term_id
WHERE t.school_id = (some school id)
GROUP BY t.school_id, t.term_id
ORDER BY t.start_date DESC, t.end_date ASC, size DESC LIMIT 1;
This works perfectly to find the largest currently or most recently active term, but I want to be able to eliminate the WHERE t.school_id = (some school id) part.
A standard greatest n per group can easily choose the largest OR most recent term, but I need to select the most recent term that ends soonest with the largest number of students.
Not sure I am interpreting your question correctly. Would be easier if you had supplied table definitions including primary and foreign keys.
If you want the the most recent term that ends soonest with the largest number of students per school, this might do it:
SELECT DISTINCT ON (t.school_id)
t.school_id, t.term_id, s.size, t.start_date, t.end_date
FROM term t
JOIN (
SELECT term_id, COUNT(s.id) AS size
FROM students
GROUP BY term_id
) s USING (term_id)
ORDER BY t.school_id, t.start_date DESC, t.end_date, size DESC;
More explanation for DISTINCT ON in this related answer:
Select first row in each GROUP BY group?

Pagination on large data sets? – Abort count(*) after a certain time

We use the following pagination technique here:
get count(*) of given filter
get first 25 records of given filter
-> render some pagination links on the page
This works pretty well as long as count(*) is reasonable fast. In our case the data size has grown to a point where a non-indexd query (although most stuff is covered by indices) takes more than a minute. So at this point the user waits for a mostly unimportant number (total records matching filter, number of pages). The first N records are often ready pretty fast.
Therefore I have two questions:
can I limit the count(*) to a certain number
or would it be possible to limit it by time? (no count() known after 20ms)
Or just in general: are there some easy ways to avoid that problem? We would like to keep the system as untouched as possible.
Database: Oracle 10g
Update
There are several scenarios
a) there's an index -> neither count(*) nor the actual select should be a problem
b) there's no index
count(*) is HUGE, and it takes ages to determine it -> rownum would help
count(*) is zero or very low, here a time limit would help. Or I could just dont do a count(*) if the result set is already below the page limit.
You could use 'where rownum < x' to limit the number of rows to count. And if you need to show to your user that you has more register, you could use x+1 in count just to see if there is more than x registers.