Finding top N entries per group in Arango

Finding top N entries per group in Arango - aggregate

I'm trying to efficiently find the top entries by group in Arango (AQL). I have a fairly standard object collection and an edge collection representing Departments and Employees in that department.
Example purpose: Find the top 2 employees in each department by most years of experience.
Sample Data:
"departments" is an object collection. Here are some entries:
_id
name
departments/1
engineering
departments/2
sales
"dept_emp_edges" is an edge collection connecting departments and employee objects by ids.
_id
_from
_to
years_exp
dept_emp_edges/1
departments/1
employees/1
3
dept_emp_edges/2
departments/1
employees/2
4
dept_emp_edges/3
departments/1
employees/3
5
dept_emp_edges/4
departments/2
employees/1
6
I would like to end up with the top 2 employees per department by most years experience:
department
employee
years_exp
departments/1
employee/3
5
departments/1
employee/2
4
departments/2
employee/1
6
Long Working Query
The following query works! But is a bit slow on larger tables and feels inefficient.
FOR dept IN departments
LET top2earners = (
FOR dep_emp_edge IN dept_emp_edges
FILTER dep_emp_edge._from == dept._id
SORT dep_emp_edge.years_exp DESC
LIMIT 2
RETURN {'department': dep_emp_edge._from,
'employee': dep_emp_edge._to,
'years_exp': dep_emp_edge.years_exp}
)
FOR row in top2earners
return {'department': dep_emp_edge._from,
'employee': dep_emp_edge._to,
'years_exp': dep_emp_edge.years_exp}
I don't like this because there is 3 loops in here and feels rather inefficient.
Short Query
However, I tried to write:
FOR dept IN departments
FOR dep_emp_edge IN dept_emp_edges
FILTER dep_emp_edge._from == dept._id
SORT dep_emp_edge.years_exp DESC
LIMIT 2
RETURN {'department': dep_emp_edge._from,
'employee': dep_emp_edge._to,
'years_exp': dep_emp_edge.years_exp}
But this last query only outputs the final department top 2 results. Not all of the top 2 in each department.
My questions are: (1) why doesn't the second shorter query give all results? and (2) I'm quite new to Arango and ArangoQL, what other things can I do to make sure this is efficient?

Your first query is incorrect as written (Query: AQL: collection or view not found: dep_emp_edge (while parsing)) - as I could only guess what you mean, I ignore it for now.
Your smaller query limits the overall results to two - counter intuitively - as you are not grouping by department.
I suggest a slightly different approach: Use the edge collection as central source and group by _from, returning one document per department, containing an array of the two top resulting employees (should they exist), not one document per employee:
FOR edge IN dept_emp_edges
SORT edge.years_exp DESC
COLLECT dep = edge._from INTO deps
LET emps = (
FOR e in deps
LIMIT 2
RETURN ZIP(["employee", "years_exp"], [e.edge._to, e.edge.years_exp])
)
RETURN {"department": dep, employees: emps}
For your example database this returns:
[
{
"department": "departments/1",
"employees": [
{
"employee": "employees/3",
"years_exp": 5
},
{
"employee": "employees/2",
"years_exp": 4
}
]
},
{
"department": "departments/2",
"employees": [
{
"employee": "employees/1",
"years_exp": 6
}
]
}
]
If the query is too slow, an index on the year_exp-field of the dept_emp_edges collection could help (Explain suggests it would).

Related

how to determine the optimal query and limit size

I am running a mongodb aggregate query to group the data of a collection and get a sum of the values in a field and insert it to another collection.
ex: collection1: [
{ name: "foo",
group_id:1,
marks: 10
},
{ name: "bar",
group_id:1,
marks: 20
},
{ name: "Hello World",
group_id:2,
marks: 40
}]
So, the group by query will insert into a collection ex: collection2 with the following data
collection2:[
{
group_id: 1,
marks: 30
}, {
group_id:2,
marks: 40
}
]
I need to do these two operations:
Group the data and get the aggregate
create a new collection with the data
Now, comes the interesting part, The data that is being grouped is of 5 billion rows, and so, the query to get the aggregate of the marks will be very slow to execute.
thus writing a node script to get the data by the group and then insert it to another collection will be not very optimal. The other way that I was thinking was to limit the data by x ex: 1000, and group the 1000s and then insert that to the collection2 and for the next 1000 , update the collection 2, and so on.
So, here are my questions. does aggregating the data by a limit and then iterating over it faster?
ex:
step 1: group and get the sum of the marks of 1000 rows
step 2: insert/update collection2 with this data
step 3: goto step1
is this above method more useful than just getting the aggregate by grouping all the 5 billion records and then inserting it in the collection2? Assuming that there is a node api that is doing the above task, how to determine/calculate the limit size for the faster operations? how do I use whenMatched to update/insert to collection2 with the marks?

How do I get set of records based on associated collection limited to populate exclusive criteria?

Using Waterline (and Sails.js) how do I query for results of a many to many association limited exclusively to matching the entire criteria set?
Room.find({type:"1to1"}).populate('participants', { id: [ 1, 2] }).
then((rooms)=>{
// rooms array of records should only contain room records
// that have both participants with an id of 1 and 2,
// but not rooms with only one participant 1 or 2
})
I am looking for the result to only contain rooms that have both specified users.

Optimising queries in mongodb

I am working on optimising my queries in mongodb.
In normal sql query there is an order in which where clauses are applied. For e.g. select * from employees where department="dept1" and floor=2 and sex="male", here first department="dept1" is applied, then floor=2 is applied and lastly sex="male".
I was wondering does it happen in a similar way in mongodb.
E.g.
DbObject search = new BasicDbObject("department", "dept1").put("floor",2).put("sex", "male");
here which match clause will be applied first or infact does mongo work in this manner at all.
This question basically arises from my background with SQL databases.
Please help.

If there are no indexes we have to scan the full collection (collection scan) in order to find the required documents. In your case if you want to apply with order [department, floor and sex] you should create this compound index:
db.employees.createIndex( { "department": 1, "floor": 1, "sex" : 1 } )
As documentation: https://docs.mongodb.org/manual/core/index-compound/
db.products.createIndex( { "item": 1, "stock": 1 } )
The order of the fields in a compound index is very important. In the
previous example, the index will contain references to documents
sorted first by the values of the item field and, within each value of
the item field, sorted by values of the stock field.

Query Exact Matches in MongoDB

I am working on developing a web application feature that suggests prices for users based on previous orders in the database. I am using the MongoDB NoSQL database. Before I begin, I am trying to figure out the best way to set up the order object to return the correct results.
When a user places an order such as the following: 1 cheeseburger + 1 fry, McDonalds, 12345 E. Street, MyTown, USA... it should only return objects that are EXACT matches from the database.
For example, I would not want to receive an order that contained 1 cheeseburger + 1 fry + 1 shake. I will be keeping running averages of the prices and counts for that exact order.
{
restaurantAddress: "12345 E. Street, MyTown, USA",
restaurantName: "McDonald's",
orders: {
{ cheeseburger: 1, fries: 2 }
: {
sumPaid: 1444.55,
numTimesOrdered: 167,
avgPaid: 8.65 (gets recomputed w/ each new order)
},
{ // repeat for each unique item config },
{ // another unique item (or items) }
}
Do you think this is a valid and efficient way to set up the document in MongoDB? Or should I be using multiple documents?
If this is valid, how can I query it to only return exact orders? I looked into $eq but it did not seem to be exactly what I was looking for.

So I believe we have solved the problem. The solution is to create a string that is unique for the order on the server side. For example, we will write a function that would transform the 1 cheeseburger + 2 fries into burger1fries2. In order to keep consistency in the database, we will first sort the entries alphabetically, so we will always hit what we intended with the query. A similar order of 2 fries + 1 cheeseburger would generate the string burger1fries2 as well.

Calculate value based on existence of records matching given criteria - FileMaker Pro 13

How can I write a calculation field in a table that outputs '1' if there are other (related) records in the same table that meet a given set of criteria and '0' otherwise?
Here's my problem explained in more detail:
I have a table containing 'students' and another containing 'exam results'. The 'exam results' table looks like this:
StudentID SubjectID Level Result
3234 1 2 A-
3234 2 4 B+
4739 1 4 C+
A student can only pass a Level 4 exam in subject 2 if they have also passed a Level 2 exam in subject 1 with a B+ or higher. I want to define a field in the 'students' table that contains a '1' if there exists an exam result belonging to the right student that meets these criteria and a '0' otherwise.
What would be the best way to do this?

Let us take an example of a Results table where the results are also calculated as a numeric value, e.g.
StudentID SubjectID Level Result cResultNum
3234 1 2 A- 95
3234 2 4 B+ 85
4739 1 4 C+ 75
and an Exams table with the following fields (among others):
RequiredSubjectID
RequiredLevel
RequiredResultNum
Given these, you can construct a relationship between Exams and (another occurrence of) Results as:
Exams::RequiredSubjectID = Results 2::SubjectID
AND
Exams::RequiredLevel = Results 2::Level
AND
Exams::RequiredResultNum ≤ Results 2::cResultNum
This allows each exam record to calculate a list of students that are eligible to take that exam as =
List ( Results 2::StudentID )
I want to define a field in the 'students' table that contains a '1'
if there exists an exam result belonging to the right student that
meets these criteria and a '0' otherwise.
This request is unclear, because there are many exams a student may want to take, and a field in the Students table can calculate only one result.

You need to do a self-join in the table for the field you want to check, for example:
Exam::Level = Exam2::Level
Exam::Student = Exam2::Student
And for the "was passed" criteria I think you could do an "If" on the calculation like this:
If ( Last(Exam2::Result) = "D" and ...(all the pass values) ; 1 ; 0 )
Edit:
It could be just with the not pass value hehe I miss that it will be like this:
If ( Last(Exam2::Result) = "F" ; 0 ; 1 )
I hope this helps you.