In Neo4j how to group by one return key, instead of all non aggregate keys in return? - group-by

Neo4j documentation mentions that:
4.3. Aggregating functions To calculate aggregated data, Cypher offers aggregation, analogous to SQL’s GROUP BY.
Aggregating functions take a set of values and calculate an aggregated
value over them. Examples are avg() that calculates the average of
multiple numeric values, or min() that finds the smallest numeric or
string value in a set of values. When we say below that an aggregating
function operates on a set of values, we mean these to be the result
of the application of the inner expression (such as n.age) to all the
records within the same aggregation group.
Aggregation can be computed over all the matching subgraphs, or it can
be further divided by introducing grouping keys. These are
non-aggregate expressions, that are used to group the values going
into the aggregate functions.
Assume we have the following return statement:
RETURN n, count(*)
We have two return expressions: n, and count(). The first, n, is not an aggregate function, and so it will be the grouping key. The
latter, count() is an aggregate expression. The matching subgraphs
will be divided into different buckets, depending on the grouping
key. The aggregate function will then be run on these buckets,
calculating an aggregate value per bucket.
I cannot figure out how to:
RETURN n, m COLLECT(n);
for example, and only use n as the grouping key, not both n and m.

This is not possible in Cypher as it does implicit group by as you have learned from the documentation. It is quite similar to SQL, except there you have to explicitly add the GROUP BY clause.
What you can do is use subqueries, or split the query into two parts, where you first aggregate the data and then iterate over each node again in the second part.

Related

What does the distinct on clause mean in cloud datastore and how does it effect the reads?

This is what the cloud datastore doc says but I'm having a hard time understanding what exactly this means:
A projection query that does not use the distinct on clause is a small operation and counts as only a single entity read for the query itself.
Grouping
Projection queries can use the distinct on clause to ensure that only the first result for each distinct combination of values for the specified properties will be returned. This will return only the first result for entities which have the same values for the properties that are being projected.
Let's say i have a table for questions and i only want to get the question text sorted by the created date would this be counted as a single read and rest as small operations?
If your goal is to just project the date and text fields, you can create a composite index on those two fields. When you query, this is a small operation with all the results as a single read. You are not trying to de-duplicate (so no distinct/on) in this case and so it is a small operation with a single read.

Order of results for `sort` using mongoose

If I have two equal values for a field. What would be the order of results for sort on that field? Random or ordered by insertion date?
If two documents have equal values for the field you're sorting on, then MongoDB will return the results in the order they are found on disk (ie Natural order)
from MongoDB Documentation :
natural order:
The order in which the database refers to documents on
disk. This is the default sort order. See $natural and Return in
Natural Order.
This may coincide with insertion date in some case, but not all of the time (especially when you perform insertion/deletion on your collection), so you should assume that this is random ordering

How to sort Array[Row] by given column index in Scala

How to sort Array[Row] by given column index in Scala?
I'm using RDD[Row].collect() which gives me array[Row], but I want to sort it based on a given column index.
I have already used quick-sort logic and it's working, but there are too many for loops and all.
I would like to use a Scala built-in API which can do this task with the minimum amount of code.
It would be much more efficient to sort the Dataframe before collecting it - if you collect it, you lose the distributed (and parallel) computation. You can use Dataframe's sort, for example - ascending order by column "col1":
val sorted = dataframe.sort(asc("col1"))

How to average millions of rows of NumberLong in Mongo?

I am trying to calculate the average of millions of records with NumberLong type in Mongo.
However aggregate and $avg doesn't work because of the sizes.
Any good approach to solve it?
You can use MapReduce for this.
Your map function would take each document and emit an object with two fields: one field value with the value you want to average and one field count with a value of 1.
Your reduce function would then sum up both the field count and the field value of all objects passed to it, returning one object representing how many documents were summarized and what their sum is.
Your finalize function would then divide the value by the count of the resulting object and return this number.
The second MapReduce example in the official documentation is very close to your use-case, you should be able to use it as a reference. The only difference is that you only want one average value, not separate ones for subsets of your collection, so you would replace key with a constant value.

Using Mongo: should we create an index tailored to each type of high-volume query?

We have two types of high-volume queries. One looks for docs involving 5 attributes: a date (lte), a value stored in an array, a value stored in a second array, one integer (gte), and one float (gte).
The second includes these five attributes plus two more.
Should we create two compound indices, one for each query? Assume each attribute has a high cardinality.
If we do, because each query involves multiple arrays, it doesn't seem like we can create an index because of Mongo's restriction. How do people structure their Mongo databases in this case?
We're using MongoMapper.
Thanks!
Indexes for queries after the first ranges in the query the value of the additional index fields drops significantly.
Conceptually, I find it best to think of the addition fields in the index pruning ever smaller sub-trees from the query. The first range chops off a large branch, the second a smaller, the third smaller, etc. My general rule of thumb is only the first range from the query in the index is of value.
The caveat to that rule is that additional fields in the index can be useful to aid sorting returned results.
For the first query I would create a index on the two array values and then which ever of the ranges will exclude the most documents. The date field is unlikely to provide high exclusion unless you can close the range (lte and gte). The integer and float is hard to tell without knowing the domain.
If the second query's two additional attributes also use ranges in the query and do not have a significantly higher exclusion value then I would just work with the one index.
Rob.